/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 836 by ph10, Wed Dec 28 17:16:11 2011 UTC revision 869 by ph10, Sat Jan 14 11:16:23 2012 UTC
# Line 25  INTRODUCTION Line 25  INTRODUCTION
25         items, and there is an option for requesting some  minor  changes  that         items, and there is an option for requesting some  minor  changes  that
26         give better JavaScript compatibility.         give better JavaScript compatibility.
27    
28           Starting with release 8.30, it is possible to compile two separate PCRE
29           libraries:  the  original,  which  supports  8-bit  character   strings
30           (including  UTF-8  strings),  and a second library that supports 16-bit
31           character strings (including UTF-16 strings). The build process  allows
32           either  one  or both to be built. The majority of the work to make this
33           possible was done by Zoltan Herczeg.
34    
35           The two libraries contain identical sets of functions, except that  the
36           names  in  the  16-bit  library start with pcre16_ instead of pcre_. To
37           avoid over-complication and reduce the documentation maintenance  load,
38           most of the documentation describes the 8-bit library, with the differ-
39           ences for the 16-bit library described separately in the  pcre16  page.
40           References  to  functions or structures of the form pcre[16]_xxx should
41           be  read  as  meaning  "pcre_xxx  when  using  the  8-bit  library  and
42           pcre16_xxx when using the 16-bit library".
43    
44         The  current implementation of PCRE corresponds approximately with Perl         The  current implementation of PCRE corresponds approximately with Perl
45         5.12, including support for UTF-8 encoded strings and  Unicode  general         5.12, including support for UTF-8/16 encoded strings and  Unicode  gen-
46         category  properties.  However,  UTF-8  and  Unicode  support has to be         eral  category properties. However, UTF-8/16 and Unicode support has to
47         explicitly enabled; it is not the default. The  Unicode  tables  corre-         be explicitly enabled; it is not the default. The Unicode tables corre-
48         spond to Unicode release 6.0.0.         spond to Unicode release 6.0.0.
49    
50         In  addition to the Perl-compatible matching function, PCRE contains an         In  addition to the Perl-compatible matching function, PCRE contains an
# Line 39  INTRODUCTION Line 55  INTRODUCTION
55    
56         PCRE  is  written  in C and released as a C library. A number of people         PCRE  is  written  in C and released as a C library. A number of people
57         have written wrappers and interfaces of various kinds.  In  particular,         have written wrappers and interfaces of various kinds.  In  particular,
58         Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now         Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
59         included as part of the PCRE distribution. The pcrecpp page has details         library. This is now included as part of  the  PCRE  distribution.  The
60         of  this  interface.  Other  people's contributions can be found in the         pcrecpp  page  has  details of this interface. Other people's contribu-
61         Contrib directory at the primary FTP site, which is:         tions can be found in the Contrib directory at the  primary  FTP  site,
62           which is:
63    
64         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
65    
66         Details of exactly which Perl regular expression features are  and  are         Details  of  exactly which Perl regular expression features are and are
67         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
68         tern and pcrecompat pages. There is a syntax summary in the  pcresyntax         tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
69         page.         page.
70    
71         Some  features  of  PCRE can be included, excluded, or changed when the         Some features of PCRE can be included, excluded, or  changed  when  the
72         library is built. The pcre_config() function makes it  possible  for  a         library  is  built.  The pcre_config() function makes it possible for a
73         client  to  discover  which  features are available. The features them-         client to discover which features are  available.  The  features  them-
74         selves are described in the pcrebuild page. Documentation about  build-         selves  are described in the pcrebuild page. Documentation about build-
75         ing  PCRE  for various operating systems can be found in the README and         ing PCRE for various operating systems can be found in the  README  and
76         NON-UNIX-USE files in the source distribution.         NON-UNIX-USE files in the source distribution.
77    
78         The library contains a number of undocumented  internal  functions  and         The  libraries contains a number of undocumented internal functions and
79         data  tables  that  are  used by more than one of the exported external         data tables that are used by more than one  of  the  exported  external
80         functions, but which are not intended  for  use  by  external  callers.         functions,  but  which  are  not  intended for use by external callers.
81         Their  names  all begin with "_pcre_", which hopefully will not provoke         Their names all begin with "_pcre_" or "_pcre16_", which hopefully will
82         any name clashes. In some environments, it is possible to control which         not  provoke  any name clashes. In some environments, it is possible to
83         external  symbols  are  exported when a shared library is built, and in         control which external symbols are exported when a  shared  library  is
84         these cases the undocumented symbols are not exported.         built, and in these cases the undocumented symbols are not exported.
85    
86    
87  USER DOCUMENTATION  USER DOCUMENTATION
88    
89         The user documentation for PCRE comprises a number  of  different  sec-         The  user  documentation  for PCRE comprises a number of different sec-
90         tions.  In the "man" format, each of these is a separate "man page". In         tions. In the "man" format, each of these is a separate "man page".  In
91         the HTML format, each is a separate page, linked from the  index  page.         the  HTML  format, each is a separate page, linked from the index page.
92         In  the  plain  text format, all the sections, except the pcredemo sec-         In the plain text format, all the sections, except  the  pcredemo  sec-
93         tion, are concatenated, for ease of searching. The sections are as fol-         tion, are concatenated, for ease of searching. The sections are as fol-
94         lows:         lows:
95    
96           pcre              this document           pcre              this document
97             pcre16            details of the 16-bit library
98           pcre-config       show PCRE installation configuration information           pcre-config       show PCRE installation configuration information
99           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
100           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
101           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
102           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
103           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper for the 8-bit library
104           pcredemo          a demonstration C program that uses PCRE           pcredemo          a demonstration C program that uses PCRE
105           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command (8-bit only)
106           pcrejit           discussion of the just-in-time optimization support           pcrejit           discussion of the just-in-time optimization support
107           pcrelimits        details of size and other limits           pcrelimits        details of size and other limits
108           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
# Line 92  USER DOCUMENTATION Line 110  USER DOCUMENTATION
110           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
111                               regular expressions                               regular expressions
112           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
113           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API for the 8-bit library
114           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
115           pcresample        discussion of the pcredemo program           pcresample        discussion of the pcredemo program
116           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
117           pcresyntax        quick syntax reference           pcresyntax        quick syntax reference
118           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
119           pcreunicode       discussion of Unicode and UTF-8 support           pcreunicode       discussion of Unicode and UTF-8/16 support
120    
121         In  addition,  in the "man" and HTML formats, there is a short page for         In addition, in the "man" and HTML formats, there is a short  page  for
122         each C library function, listing its arguments and results.         each 8-bit C library function, listing its arguments and results.
123    
124    
125  AUTHOR  AUTHOR
# Line 110  AUTHOR Line 128  AUTHOR
128         University Computing Service         University Computing Service
129         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
130    
131         Putting an actual email address here seems to have been a spam  magnet,         Putting  an actual email address here seems to have been a spam magnet,
132         so  I've  taken  it away. If you want to email me, use my two initials,         so I've taken it away. If you want to email me, use  my  two  initials,
133         followed by the two digits 10, at the domain cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
134    
135    
136  REVISION  REVISION
137    
138         Last updated: 24 August 2011         Last updated: 10 January 2012
139         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
140  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
141    
142    
143    PCRE(3)                                                                PCRE(3)
144    
145    
146    NAME
147           PCRE - Perl-compatible regular expressions
148    
149           #include <pcre.h>
150    
151    
152    PCRE 16-BIT API BASIC FUNCTIONS
153    
154           pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
155                const char **errptr, int *erroffset,
156                const unsigned char *tableptr);
157    
158           pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
159                int *errorcodeptr,
160                const char **errptr, int *erroffset,
161                const unsigned char *tableptr);
162    
163           pcre16_extra *pcre16_study(const pcre16 *code, int options,
164                const char **errptr);
165    
166           void pcre16_free_study(pcre16_extra *extra);
167    
168           int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
169                PCRE_SPTR16 subject, int length, int startoffset,
170                int options, int *ovector, int ovecsize);
171    
172           int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
173                PCRE_SPTR16 subject, int length, int startoffset,
174                int options, int *ovector, int ovecsize,
175                int *workspace, int wscount);
176    
177    
178    PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
179    
180           int pcre16_copy_named_substring(const pcre16 *code,
181                PCRE_SPTR16 subject, int *ovector,
182                int stringcount, PCRE_SPTR16 stringname,
183                PCRE_UCHAR16 *buffer, int buffersize);
184    
185           int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
186                int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
187                int buffersize);
188    
189           int pcre16_get_named_substring(const pcre16 *code,
190                PCRE_SPTR16 subject, int *ovector,
191                int stringcount, PCRE_SPTR16 stringname,
192                PCRE_SPTR16 *stringptr);
193    
194           int pcre16_get_stringnumber(const pcre16 *code,
195                PCRE_SPTR16 name);
196    
197           int pcre16_get_stringtable_entries(const pcre16 *code,
198                PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
199    
200           int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
201                int stringcount, int stringnumber,
202                PCRE_SPTR16 *stringptr);
203    
204           int pcre16_get_substring_list(PCRE_SPTR16 subject,
205                int *ovector, int stringcount, PCRE_SPTR16 **listptr);
206    
207           void pcre16_free_substring(PCRE_SPTR16 stringptr);
208    
209           void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
210    
211    
212    PCRE 16-BIT API AUXILIARY FUNCTIONS
213    
214           pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
215    
216           void pcre16_jit_stack_free(pcre16_jit_stack *stack);
217    
218           void pcre16_assign_jit_stack(pcre16_extra *extra,
219                pcre16_jit_callback callback, void *data);
220    
221           const unsigned char *pcre16_maketables(void);
222    
223           int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
224                int what, void *where);
225    
226           int pcre16_refcount(pcre16 *code, int adjust);
227    
228           int pcre16_config(int what, void *where);
229    
230           const char *pcre16_version(void);
231    
232           int pcre16_pattern_to_host_byte_order(pcre16 *code,
233                pcre16_extra *extra, const unsigned char *tables);
234    
235    
236    PCRE 16-BIT API INDIRECTED FUNCTIONS
237    
238           void *(*pcre16_malloc)(size_t);
239    
240           void (*pcre16_free)(void *);
241    
242           void *(*pcre16_stack_malloc)(size_t);
243    
244           void (*pcre16_stack_free)(void *);
245    
246           int (*pcre16_callout)(pcre16_callout_block *);
247    
248    
249    PCRE 16-BIT API 16-BIT-ONLY FUNCTION
250    
251           int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
252                PCRE_SPTR16 input, int length, int *byte_order,
253                int keep_boms);
254    
255    
256    THE PCRE 16-BIT LIBRARY
257    
258           Starting  with  release  8.30, it is possible to compile a PCRE library
259           that supports 16-bit character strings, including  UTF-16  strings,  as
260           well  as  or instead of the original 8-bit library. The majority of the
261           work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
262           libraries contain identical sets of functions, used in exactly the same
263           way. Only the names of the functions and the data types of their  argu-
264           ments  and results are different. To avoid over-complication and reduce
265           the documentation maintenance load,  most  of  the  PCRE  documentation
266           describes  the  8-bit  library,  with only occasional references to the
267           16-bit library. This page describes what is different when you use  the
268           16-bit library.
269    
270           WARNING:  A  single  application can be linked with both libraries, but
271           you must take care when processing any particular pattern to use  func-
272           tions  from  just one library. For example, if you want to study a pat-
273           tern that was compiled with  pcre16_compile(),  you  must  do  so  with
274           pcre16_study(), not pcre_study(), and you must free the study data with
275           pcre16_free_study().
276    
277    
278    THE HEADER FILE
279    
280           There is only one header file, pcre.h. It contains prototypes  for  all
281           the  functions  in  both  libraries,  as  well as definitions of flags,
282           structures, error codes, etc.
283    
284    
285    THE LIBRARY NAME
286    
287           In Unix-like systems, the 16-bit library is called libpcre16,  and  can
288           normally  be  accesss  by adding -lpcre16 to the command for linking an
289           application that uses PCRE.
290    
291    
292    STRING TYPES
293    
294           In the 8-bit library, strings are passed to PCRE library  functions  as
295           vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
296           strings are passed as vectors of unsigned 16-bit quantities. The  macro
297           PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
298           defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
299           int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
300           as "short int", but checks that it really is a 16-bit data type. If  it
301           is not, the build fails with an error message telling the maintainer to
302           modify the definition appropriately.
303    
304    
305    STRUCTURE TYPES
306    
307           The types of the opaque structures that are used  for  compiled  16-bit
308           patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
309           The  type  of  the  user-accessible  structure  that  is  returned   by
310           pcre16_study()  is  pcre16_extra, and the type of the structure that is
311           used for passing data to a callout  function  is  pcre16_callout_block.
312           These structures contain the same fields, with the same names, as their
313           8-bit counterparts. The only difference is that pointers  to  character
314           strings are 16-bit instead of 8-bit types.
315    
316    
317    16-BIT FUNCTIONS
318    
319           For  every function in the 8-bit library there is a corresponding func-
320           tion in the 16-bit library with a name that starts with pcre16_ instead
321           of  pcre_.  The  prototypes are listed above. In addition, there is one
322           extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
323           function  that converts a UTF-16 character string to host byte order if
324           necessary. The other 16-bit  functions  expect  the  strings  they  are
325           passed to be in host byte order.
326    
327           The input and output arguments of pcre16_utf16_to_host_byte_order() may
328           point to the same address, that is, conversion in place  is  supported.
329           The output buffer must be at least as long as the input.
330    
331           The  length  argument  specifies the number of 16-bit data units in the
332           input string; a negative value specifies a zero-terminated string.
333    
334           If byte_order is NULL, it is assumed that the string starts off in host
335           byte  order. This may be changed by byte-order marks (BOMs) anywhere in
336           the string (commonly as the first character).
337    
338           If byte_order is not NULL, a non-zero value of the integer to which  it
339           points  means  that  the input starts off in host byte order, otherwise
340           the opposite order is assumed. Again, BOMs in  the  string  can  change
341           this. The final byte order is passed back at the end of processing.
342    
343           If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
344           copied into the output string. Otherwise they are discarded.
345    
346           The result of the function is the number of 16-bit  units  placed  into
347           the  output  buffer,  including  the  zero terminator if the string was
348           zero-terminated.
349    
350    
351    SUBJECT STRING OFFSETS
352    
353           The offsets within subject strings that are returned  by  the  matching
354           functions are in 16-bit units rather than bytes.
355    
356    
357    NAMED SUBPATTERNS
358    
359           The  name-to-number translation table that is maintained for named sub-
360           patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
361           function returns the length of each entry in the table as the number of
362           16-bit data units.
363    
364    
365    OPTION NAMES
366    
367           There   are   two   new   general   option   names,   PCRE_UTF16    and
368           PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
369           PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
370           define the same bits in the options word.
371    
372           For  the  pcre16_config() function there is an option PCRE_CONFIG_UTF16
373           that returns 1 if UTF-16 support is configured, otherwise  0.  If  this
374           option  is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is
375           given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.
376    
377    
378    CHARACTER CODES
379    
380           In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
381           treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
382           that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
383           types  for characters less than 0xff can therefore be influenced by the
384           locale in the same way as before.  Characters greater  than  0xff  have
385           only one case, and no "type" (such as letter or digit).
386    
387           In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
388           0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
389           because  those  are "surrogate" values that are used in pairs to encode
390           values greater than 0xffff.
391    
392           A UTF-16 string can indicate its endianness by special code knows as  a
393           byte-order mark (BOM). The PCRE functions do not handle this, expecting
394           strings  to  be  in  host  byte  order.  A  utility   function   called
395           pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
396           above).
397    
398    
399    ERROR NAMES
400    
401           The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
402           spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
403           given when a compiled pattern is passed to a  function  that  processes
404           patterns  in  the  other  mode, for example, if a pattern compiled with
405           pcre_compile() is passed to pcre16_exec().
406    
407           There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
408           invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
409           UTF-8 strings that are described in the section entitled "Reason  codes
410           for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
411           are:
412    
413             PCRE_UTF16_ERR1  Missing low surrogate at end of string
414             PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
415             PCRE_UTF16_ERR3  Isolated low surrogate
416             PCRE_UTF16_ERR4  Invalid character 0xfffe
417    
418    
419    ERROR TEXTS
420    
421           If there is an error while compiling a pattern, the error text that  is
422           passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
423           character string, zero-terminated.
424    
425    
426    CALLOUTS
427    
428           The subject and mark fields in the callout block that is  passed  to  a
429           callout function point to 16-bit vectors.
430    
431    
432    TESTING
433    
434           The  pcretest  program continues to operate with 8-bit input and output
435           files, but it can be used for testing the 16-bit library. If it is  run
436           with the command line option -16, patterns and subject strings are con-
437           verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
438           library  functions  are used instead of the 8-bit ones. Returned 16-bit
439           strings are converted to 8-bit for output. If the 8-bit library was not
440           compiled, pcretest defaults to 16-bit and the -16 option is ignored.
441    
442           When  PCRE  is  being built, the RunTest script that is called by "make
443           check" uses the pcretest -C option to discover which of the  8-bit  and
444           16-bit libraries has been built, and runs the tests appropriately.
445    
446    
447    NOT SUPPORTED IN 16-BIT MODE
448    
449           Not all the features of the 8-bit library are available with the 16-bit
450           library. The C++ and POSIX wrapper functions  support  only  the  8-bit
451           library, and the pcregrep program is at present 8-bit only.
452    
453    
454    AUTHOR
455    
456           Philip Hazel
457           University Computing Service
458           Cambridge CB2 3QH, England.
459    
460    
461    REVISION
462    
463           Last updated: 08 January 2012
464           Copyright (c) 1997-2012 University of Cambridge.
465    ------------------------------------------------------------------------------
466    
467    
468  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
469    
470    
# Line 158  PCRE BUILD-TIME OPTIONS Line 501  PCRE BUILD-TIME OPTIONS
501         is not described.         is not described.
502    
503    
504    BUILDING 8-BIT and 16-BIT LIBRARIES
505    
506           By  default,  a  library  called libpcre is built, containing functions
507           that take string arguments contained in vectors  of  bytes,  either  as
508           single-byte  characters,  or interpreted as UTF-8 strings. You can also
509           build a separate library, called libpcre16, in which strings  are  con-
510           tained  in  vectors of 16-bit data units and interpreted either as sin-
511           gle-unit characters or UTF-16 strings, by adding
512    
513             --enable-pcre16
514    
515           to the configure command. If you do not want the 8-bit library, add
516    
517             --disable-pcre8
518    
519           as well. At least one of the two libraries must be built. Note that the
520           C++  and  POSIX wrappers are for the 8-bit library only, and that pcre-
521           grep is an 8-bit program. None of these are built if  you  select  only
522           the 16-bit library.
523    
524    
525  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
526    
527         The  PCRE building process uses libtool to build both shared and static         The  PCRE building process uses libtool to build both shared and static
# Line 172  BUILDING SHARED AND STATIC LIBRARIES Line 536  BUILDING SHARED AND STATIC LIBRARIES
536    
537  C++ SUPPORT  C++ SUPPORT
538    
539         By default, the configure script will search for a C++ compiler and C++         By  default,  if the 8-bit library is being built, the configure script
540         header files. If it finds them, it automatically builds the C++ wrapper         will search for a C++ compiler and C++ header files. If it finds  them,
541         library for PCRE. You can disable this by adding         it  automatically  builds  the C++ wrapper library (which supports only
542           8-bit strings). You can disable this by adding
543    
544           --disable-cpp           --disable-cpp
545    
546         to the configure command.         to the configure command.
547    
548    
549  UTF-8 SUPPORT  UTF-8 and UTF-16 SUPPORT
550    
551         To build PCRE with support for UTF-8 Unicode character strings, add         To build PCRE with support for UTF Unicode character strings, add
552    
553           --enable-utf8           --enable-utf
554    
555         to  the  configure  command.  Of  itself, this does not make PCRE treat         to the configure command.  This  setting  applies  to  both  libraries,
556         strings as UTF-8. As well as compiling PCRE with this option, you  also         adding support for UTF-8 to the 8-bit library and support for UTF-16 to
557         have  have to set the PCRE_UTF8 option when you call the pcre_compile()         the 16-bit library. It is not possible to build one  library  with  UTF
558         or pcre_compile2() functions.         support and the other without in the same configuration. (For backwards
559           compatibility, --enable-utf8 is a synonym of --enable-utf.)
560    
561         If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE         Of itself, this setting does not make PCRE treat strings  as  UTF-8  or
562           UTF-16.  As well as compiling PCRE with this option, you also have have
563           to set the PCRE_UTF8 or PCRE_UTF16 option when you call one of the pat-
564           tern compiling functions.
565    
566           If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
567         expects its input to be either ASCII or UTF-8 (depending on the runtime         expects its input to be either ASCII or UTF-8 (depending on the runtime
568         option). It is not possible to support both EBCDIC and UTF-8  codes  in         option).  It  is not possible to support both EBCDIC and UTF-8 codes in
569         the  same  version  of  the  library.  Consequently,  --enable-utf8 and         the  same  version  of  the  library.  Consequently,  --enable-utf  and
570         --enable-ebcdic are mutually exclusive.         --enable-ebcdic are mutually exclusive.
571    
572    
573  UNICODE CHARACTER PROPERTY SUPPORT  UNICODE CHARACTER PROPERTY SUPPORT
574    
575         UTF-8 support allows PCRE to process character values greater than  255         UTF  support allows the libraries to process character codepoints up to
576         in  the  strings that it handles. On its own, however, it does not pro-         0x10ffff in the strings that they handle. On its own, however, it  does
577         vide any facilities for accessing the properties of such characters. If         not provide any facilities for accessing the properties of such charac-
578         you  want  to  be able to use the pattern escapes \P, \p, and \X, which         ters. If you want to be able to use the pattern escapes \P, \p, and \X,
579         refer to Unicode character properties, you must add         which refer to Unicode character properties, you must add
580    
581           --enable-unicode-properties           --enable-unicode-properties
582    
583         to the configure command. This implies UTF-8 support, even if you  have         to  the  configure  command. This implies UTF support, even if you have
584         not explicitly requested it.         not explicitly requested it.
585    
586         Including  Unicode  property  support  adds around 30K of tables to the         Including Unicode property support adds around 30K  of  tables  to  the
587         PCRE library. Only the general category properties such as  Lu  and  Nd         PCRE  library.  Only  the general category properties such as Lu and Nd
588         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
589    
590    
# Line 223  JUST-IN-TIME COMPILER SUPPORT Line 594  JUST-IN-TIME COMPILER SUPPORT
594    
595           --enable-jit           --enable-jit
596    
597         This  support  is available only for certain hardware architectures. If         This support is available only for certain hardware  architectures.  If
598         this option is set for an  unsupported  architecture,  a  compile  time         this  option  is  set  for  an unsupported architecture, a compile time
599         error  occurs.   See  the pcrejit documentation for a discussion of JIT         error occurs.  See the pcrejit documentation for a  discussion  of  JIT
600         usage. When JIT support is enabled, pcregrep automatically makes use of         usage. When JIT support is enabled, pcregrep automatically makes use of
601         it, unless you add         it, unless you add
602    
# Line 236  JUST-IN-TIME COMPILER SUPPORT Line 607  JUST-IN-TIME COMPILER SUPPORT
607    
608  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
609    
610         By  default,  PCRE interprets the linefeed (LF) character as indicating         By default, PCRE interprets the linefeed (LF) character  as  indicating
611         the end of a line. This is the normal newline  character  on  Unix-like         the  end  of  a line. This is the normal newline character on Unix-like
612         systems.  You  can compile PCRE to use carriage return (CR) instead, by         systems. You can compile PCRE to use carriage return (CR)  instead,  by
613         adding         adding
614    
615           --enable-newline-is-cr           --enable-newline-is-cr
616    
617         to the  configure  command.  There  is  also  a  --enable-newline-is-lf         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
618         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
619    
620         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 255  CODE VALUE OF NEWLINE Line 626  CODE VALUE OF NEWLINE
626    
627           --enable-newline-is-anycrlf           --enable-newline-is-anycrlf
628    
629         which causes PCRE to recognize any of the three sequences  CR,  LF,  or         which  causes  PCRE  to recognize any of the three sequences CR, LF, or
630         CRLF as indicating a line ending. Finally, a fifth option, specified by         CRLF as indicating a line ending. Finally, a fifth option, specified by
631    
632           --enable-newline-is-any           --enable-newline-is-any
633    
634         causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
635    
636         Whatever  line  ending convention is selected when PCRE is built can be         Whatever line ending convention is selected when PCRE is built  can  be
637         overridden when the library functions are called. At build time  it  is         overridden  when  the library functions are called. At build time it is
638         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
639    
640    
641  WHAT \R MATCHES  WHAT \R MATCHES
642    
643         By  default,  the  sequence \R in a pattern matches any Unicode newline         By default, the sequence \R in a pattern matches  any  Unicode  newline
644         sequence, whatever has been selected as the line  ending  sequence.  If         sequence,  whatever  has  been selected as the line ending sequence. If
645         you specify         you specify
646    
647           --enable-bsr-anycrlf           --enable-bsr-anycrlf
648    
649         the  default  is changed so that \R matches only CR, LF, or CRLF. What-         the default is changed so that \R matches only CR, LF, or  CRLF.  What-
650         ever is selected when PCRE is built can be overridden when the  library         ever  is selected when PCRE is built can be overridden when the library
651         functions are called.         functions are called.
652    
653    
654  POSIX MALLOC USAGE  POSIX MALLOC USAGE
655    
656         When PCRE is called through the POSIX interface (see the pcreposix doc-         When the 8-bit library is called through the POSIX interface  (see  the
657         umentation), additional working storage is  required  for  holding  the         pcreposix  documentation),  additional  working storage is required for
658         pointers  to capturing substrings, because PCRE requires three integers         holding the pointers to capturing  substrings,  because  PCRE  requires
659         per substring, whereas the POSIX interface provides only  two.  If  the         three integers per substring, whereas the POSIX interface provides only
660         number of expected substrings is small, the wrapper function uses space         two. If the number of expected substrings is small, the  wrapper  func-
661         on the stack, because this is faster than using malloc() for each call.         tion  uses  space  on the stack, because this is faster than using mal-
662         The default threshold above which the stack is no longer used is 10; it         loc() for each call. The default threshold above which the stack is  no
663         can be changed by adding a setting such as         longer used is 10; it can be changed by adding a setting such as
664    
665           --with-posix-malloc-threshold=20           --with-posix-malloc-threshold=20
666    
# Line 298  POSIX MALLOC USAGE Line 669  POSIX MALLOC USAGE
669    
670  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
671    
672         Within a compiled pattern, offset values are used  to  point  from  one         Within  a  compiled  pattern,  offset values are used to point from one
673         part  to another (for example, from an opening parenthesis to an alter-         part to another (for example, from an opening parenthesis to an  alter-
674         nation metacharacter). By default, two-byte values are used  for  these         nation  metacharacter).  By default, two-byte values are used for these
675         offsets,  leading  to  a  maximum size for a compiled pattern of around         offsets, leading to a maximum size for a  compiled  pattern  of  around
676         64K. This is sufficient to handle all but the most  gigantic  patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
677         Nevertheless,  some  people do want to process truyl enormous patterns,         Nevertheless, some people do want to process truly  enormous  patterns,
678         so it is possible to compile PCRE to use three-byte or  four-byte  off-         so  it  is possible to compile PCRE to use three-byte or four-byte off-
679         sets by adding a setting such as         sets by adding a setting such as
680    
681           --with-link-size=3           --with-link-size=3
682    
683         to  the  configure  command.  The value given must be 2, 3, or 4. Using         to the configure command. The value given must be 2, 3, or 4.  For  the
684         longer offsets slows down the operation of PCRE because it has to  load         16-bit  library,  a value of 3 is rounded up to 4. Using longer offsets
685         additional bytes when handling them.         slows down the operation of PCRE because it has to load additional data
686           when handling them.
687    
688    
689  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
# Line 403  USING EBCDIC CODE Line 775  USING EBCDIC CODE
775         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
776         bles.  You  should  only  use  it if you know that you are in an EBCDIC         bles.  You  should  only  use  it if you know that you are in an EBCDIC
777         environment (for example,  an  IBM  mainframe  operating  system).  The         environment (for example,  an  IBM  mainframe  operating  system).  The
778         --enable-ebcdic option is incompatible with --enable-utf8.         --enable-ebcdic option is incompatible with --enable-utf.
779    
780    
781  PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT  PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
# Line 469  PCRETEST OPTION FOR LIBREADLINE SUPPORT Line 841  PCRETEST OPTION FOR LIBREADLINE SUPPORT
841    
842  SEE ALSO  SEE ALSO
843    
844         pcreapi(3), pcre_config(3).         pcreapi(3), pcre16, pcre_config(3).
845    
846    
847  AUTHOR  AUTHOR
# Line 481  AUTHOR Line 853  AUTHOR
853    
854  REVISION  REVISION
855    
856         Last updated: 06 September 2011         Last updated: 07 January 2012
857         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
858  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
859    
860    
861  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
862    
863    
# Line 498  PCRE MATCHING ALGORITHMS Line 870  PCRE MATCHING ALGORITHMS
870         This document describes the two different algorithms that are available         This document describes the two different algorithms that are available
871         in PCRE for matching a compiled regular expression against a given sub-         in PCRE for matching a compiled regular expression against a given sub-
872         ject  string.  The  "standard"  algorithm  is  the  one provided by the         ject  string.  The  "standard"  algorithm  is  the  one provided by the
873         pcre_exec() function.  This works in the same was  as  Perl's  matching         pcre_exec() and pcre16_exec() functions. These work in the same was  as
874         function, and provides a Perl-compatible matching operation.         Perl's matching function, and provide a Perl-compatible matching opera-
875           tion. The just-in-time (JIT) optimization  that  is  described  in  the
876         An  alternative  algorithm is provided by the pcre_dfa_exec() function;         pcrejit documentation is compatible with these functions.
877         this operates in a different way, and is not  Perl-compatible.  It  has  
878         advantages  and disadvantages compared with the standard algorithm, and         An  alternative  algorithm  is  provided  by  the  pcre_dfa_exec()  and
879         these are described below.         pcre16_dfa_exec() functions; they operate in a different way,  and  are
880           not  Perl-compatible. This alternative has advantages and disadvantages
881           compared with the standard algorithm, and these are described below.
882    
883         When there is only one possible way in which a given subject string can         When there is only one possible way in which a given subject string can
884         match  a pattern, the two algorithms give the same answer. A difference         match  a pattern, the two algorithms give the same answer. A difference
# Line 632  THE ALTERNATIVE MATCHING ALGORITHM Line 1006  THE ALTERNATIVE MATCHING ALGORITHM
1006         6. Callouts are supported, but the value of the  capture_top  field  is         6. Callouts are supported, but the value of the  capture_top  field  is
1007         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
1008    
1009         7.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The  \C  escape  sequence, which (in the standard algorithm) always
1010         single byte, even in UTF-8  mode,  is  not  supported  in  UTF-8  mode,         matches a single data unit, even in UTF-8 or UTF-16 modes, is not  sup-
1011         because  the alternative algorithm moves through the subject string one         ported  in these modes, because the alternative algorithm moves through
1012         character at a time, for all active paths through the tree.         the subject string one character (not data unit) at  a  time,  for  all
1013           active paths through the tree.
1014    
1015         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1016         are  not  supported.  (*FAIL)  is supported, and behaves like a failing         are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1017         negative assertion.         negative assertion.
1018    
1019    
1020  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1021    
1022         Using the alternative matching algorithm provides the following  advan-         Using  the alternative matching algorithm provides the following advan-
1023         tages:         tages:
1024    
1025         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
1026         ically found, and in particular, the longest match is  found.  To  find         ically  found,  and  in particular, the longest match is found. To find
1027         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
1028         things with callouts.         things with callouts.
1029    
1030         2. Because the alternative algorithm  scans  the  subject  string  just         2.  Because  the  alternative  algorithm  scans the subject string just
1031         once,  and  never  needs to backtrack, it is possible to pass very long         once, and never needs to backtrack (except for lookbehinds), it is pos-
1032         subject strings to the matching function in  several  pieces,  checking         sible  to  pass  very  long subject strings to the matching function in
1033         for  partial  matching  each time. Although it is possible to do multi-         several pieces, checking for partial matching each time. Although it is
1034         segment matching using the standard algorithm (pcre_exec()), by retain-         possible  to  do multi-segment matching using the standard algorithm by
1035         ing  partially matched substrings, it is more complicated. The pcrepar-         retaining partially matched substrings, it  is  more  complicated.  The
1036         tial documentation gives details  of  partial  matching  and  discusses         pcrepartial  documentation  gives  details of partial matching and dis-
1037         multi-segment matching.         cusses multi-segment matching.
1038    
1039    
1040  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1041    
1042         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
1043    
1044         1.  It  is  substantially  slower  than the standard algorithm. This is         1. It is substantially slower than  the  standard  algorithm.  This  is
1045         partly because it has to search for all possible matches, but  is  also         partly  because  it has to search for all possible matches, but is also
1046         because it is less susceptible to optimization.         because it is less susceptible to optimization.
1047    
1048         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 685  AUTHOR Line 1060  AUTHOR
1060    
1061  REVISION  REVISION
1062    
1063         Last updated: 19 November 2011         Last updated: 08 January 2012
1064         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
1065  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
1066    
1067    
1068  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
1069    
1070    
1071  NAME  NAME
1072         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
1073    
1074           #include <pcre.h>
1075    
 PCRE NATIVE API BASIC FUNCTIONS  
1076    
1077         #include <pcre.h>  PCRE NATIVE API BASIC FUNCTIONS
1078    
1079         pcre *pcre_compile(const char *pattern, int options,         pcre *pcre_compile(const char *pattern, int options,
1080              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
# Line 719  PCRE NATIVE API BASIC FUNCTIONS Line 1094  PCRE NATIVE API BASIC FUNCTIONS
1094              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1095              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1096    
   
 PCRE NATIVE API AUXILIARY FUNCTIONS  
   
        pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);  
   
        void pcre_jit_stack_free(pcre_jit_stack *stack);  
   
        void pcre_assign_jit_stack(pcre_extra *extra,  
             pcre_jit_callback callback, void *data);  
   
1097         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1098              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1099              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
1100              int *workspace, int wscount);              int *workspace, int wscount);
1101    
1102    
1103    PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1104    
1105         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
1106              const char *subject, int *ovector,              const char *subject, int *ovector,
1107              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 765  PCRE NATIVE API AUXILIARY FUNCTIONS Line 1133  PCRE NATIVE API AUXILIARY FUNCTIONS
1133    
1134         void pcre_free_substring_list(const char **stringptr);         void pcre_free_substring_list(const char **stringptr);
1135    
1136    
1137    PCRE NATIVE API AUXILIARY FUNCTIONS
1138    
1139           pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1140    
1141           void pcre_jit_stack_free(pcre_jit_stack *stack);
1142    
1143           void pcre_assign_jit_stack(pcre_extra *extra,
1144                pcre_jit_callback callback, void *data);
1145    
1146         const unsigned char *pcre_maketables(void);         const unsigned char *pcre_maketables(void);
1147    
1148         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1149              int what, void *where);              int what, void *where);
1150    
        int pcre_info(const pcre *code, int *optptr, int *firstcharptr);  
   
1151         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1152    
1153         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1154    
1155         char *pcre_version(void);         const char *pcre_version(void);
1156    
1157           int pcre_pattern_to_host_byte_order(pcre *code,
1158                pcre_extra *extra, const unsigned char *tables);
1159    
1160    
1161  PCRE NATIVE API INDIRECTED FUNCTIONS  PCRE NATIVE API INDIRECTED FUNCTIONS
# Line 792  PCRE NATIVE API INDIRECTED FUNCTIONS Line 1171  PCRE NATIVE API INDIRECTED FUNCTIONS
1171         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
1172    
1173    
1174    PCRE 8-BIT AND 16-BIT LIBRARIES
1175    
1176           From  release  8.30,  PCRE  can  be  compiled as a library for handling
1177           16-bit character strings as  well  as,  or  instead  of,  the  original
1178           library that handles 8-bit character strings. To avoid too much compli-
1179           cation, this document describes the 8-bit versions  of  the  functions,
1180           with only occasional references to the 16-bit library.
1181    
1182           The  16-bit  functions  operate in the same way as their 8-bit counter-
1183           parts; they just use different  data  types  for  their  arguments  and
1184           results, and their names start with pcre16_ instead of pcre_. For every
1185           option that has UTF8 in its name (for example, PCRE_UTF8), there  is  a
1186           corresponding 16-bit name with UTF8 replaced by UTF16. This facility is
1187           in fact just cosmetic; the 16-bit option names define the same bit val-
1188           ues.
1189    
1190           References to bytes and UTF-8 in this document should be read as refer-
1191           ences to 16-bit data  quantities  and  UTF-16  when  using  the  16-bit
1192           library,  unless specified otherwise. More details of the specific dif-
1193           ferences for the 16-bit library are given in the pcre16 page.
1194    
1195    
1196  PCRE API OVERVIEW  PCRE API OVERVIEW
1197    
1198         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
1199         are also some wrapper functions that correspond to  the  POSIX  regular         are  also some wrapper functions (for the 8-bit library only) that cor-
1200         expression  API,  but they do not give access to all the functionality.         respond to the POSIX regular expression  API,  but  they  do  not  give
1201         They are described in the pcreposix documentation. Both of  these  APIs         access  to  all  the functionality. They are described in the pcreposix
1202         define  a  set  of  C function calls. A C++ wrapper is also distributed         documentation. Both of these APIs define a set of C function  calls.  A
1203         with PCRE. It is documented in the pcrecpp page.         C++ wrapper (again for the 8-bit library only) is also distributed with
1204           PCRE. It is documented in the pcrecpp page.
1205    
1206         The native API C function prototypes are defined  in  the  header  file         The native API C function prototypes are defined  in  the  header  file
1207         pcre.h,  and  on Unix systems the library itself is called libpcre.  It         pcre.h,  and  on Unix-like systems the (8-bit) library itself is called
1208         can normally be accessed by adding -lpcre to the command for linking an         libpcre. It can normally be accessed by adding -lpcre  to  the  command
1209         application  that  uses  PCRE.  The  header  file  defines  the  macros         for  linking an application that uses PCRE. The header file defines the
1210         PCRE_MAJOR and PCRE_MINOR to contain the major and minor  release  num-         macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1211         bers  for  the  library.  Applications can use these to include support         numbers  for the library. Applications can use these to include support
1212         for different releases of PCRE.         for different releases of PCRE.
1213    
1214         In a Windows environment, if you want to statically link an application         In a Windows environment, if you want to statically link an application
# Line 865  PCRE API OVERVIEW Line 1267  PCRE API OVERVIEW
1267         built are used.         built are used.
1268    
1269         The  function  pcre_fullinfo()  is used to find out information about a         The  function  pcre_fullinfo()  is used to find out information about a
1270         compiled pattern; pcre_info() is an obsolete version that returns  only         compiled pattern. The function pcre_version() returns a  pointer  to  a
1271         some  of  the available information, but is retained for backwards com-         string containing the version of PCRE and its date of release.
        patibility.  The function pcre_version() returns a pointer to a  string  
        containing the version of PCRE and its date of release.  
1272    
1273         The  function  pcre_refcount()  maintains  a  reference count in a data         The  function  pcre_refcount()  maintains  a  reference count in a data
1274         block containing a compiled pattern. This is provided for  the  benefit         block containing a compiled pattern. This is provided for  the  benefit
# Line 955  SAVING PRECOMPILED PATTERNS FOR LATER US Line 1355  SAVING PRECOMPILED PATTERNS FOR LATER US
1355         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
1356         later  time,  possibly by a different program, and even on a host other         later  time,  possibly by a different program, and even on a host other
1357         than the one on which  it  was  compiled.  Details  are  given  in  the         than the one on which  it  was  compiled.  Details  are  given  in  the
1358         pcreprecompile  documentation.  However, compiling a regular expression         pcreprecompile  documentation,  which  includes  a  description  of the
1359         with one version of PCRE for use with a different version is not  guar-         pcre_pattern_to_host_byte_order() function. However, compiling a  regu-
1360         anteed to work and may cause crashes.         lar  expression  with one version of PCRE for use with a different ver-
1361           sion is not guaranteed to work and may cause crashes.
1362    
1363    
1364  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
1365    
1366         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1367    
1368         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
1369         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
1370         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
1371         tures.         tures.
1372    
1373         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
1374         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
1375         into which the information is  placed.  The  following  information  is         into  which  the  information  is placed. The returned value is zero on
1376           success, or the negative error code PCRE_ERROR_BADOPTION if  the  value
1377           in  the  first argument is not recognized. The following information is
1378         available:         available:
1379    
1380           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
1381    
1382         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
1383         able; otherwise it is set to zero.         able;  otherwise  it  is  set  to  zero. If this option is given to the
1384           16-bit  version  of  this  function,  pcre16_config(),  the  result  is
1385           PCRE_ERROR_BADOPTION.
1386    
1387             PCRE_CONFIG_UTF16
1388    
1389           The output is an integer that is set to one if UTF-16 support is avail-
1390           able; otherwise it is set to zero. This value should normally be  given
1391           to the 16-bit version of this function, pcre16_config(). If it is given
1392           to the 8-bit version of this function, the result is  PCRE_ERROR_BADOP-
1393           TION.
1394    
1395           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
1396    
1397         The output is an integer that is set to  one  if  support  for  Unicode         The  output  is  an  integer  that is set to one if support for Unicode
1398         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
1399    
1400           PCRE_CONFIG_JIT           PCRE_CONFIG_JIT
# Line 991  CHECKING BUILD-TIME OPTIONS Line 1404  CHECKING BUILD-TIME OPTIONS
1404    
1405           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1406    
1407         The output is an integer whose value specifies  the  default  character         The  output  is  an integer whose value specifies the default character
1408         sequence  that is recognized as meaning "newline". The four values that         sequence that is recognized as meaning "newline". The four values  that
1409         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1410         and  -1  for  ANY.  Though they are derived from ASCII, the same values         and -1 for ANY.  Though they are derived from ASCII,  the  same  values
1411         are returned in EBCDIC environments. The default should normally corre-         are returned in EBCDIC environments. The default should normally corre-
1412         spond to the standard sequence for your operating system.         spond to the standard sequence for your operating system.
1413    
1414           PCRE_CONFIG_BSR           PCRE_CONFIG_BSR
1415    
1416         The output is an integer whose value indicates what character sequences         The output is an integer whose value indicates what character sequences
1417         the \R escape sequence matches by default. A value of 0 means  that  \R         the  \R  escape sequence matches by default. A value of 0 means that \R
1418         matches  any  Unicode  line ending sequence; a value of 1 means that \R         matches any Unicode line ending sequence; a value of 1  means  that  \R
1419         matches only CR, LF, or CRLF. The default can be overridden when a pat-         matches only CR, LF, or CRLF. The default can be overridden when a pat-
1420         tern is compiled or matched.         tern is compiled or matched.
1421    
1422           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1423    
1424         The  output  is  an  integer that contains the number of bytes used for         The output is an integer that contains the number  of  bytes  used  for
1425         internal linkage in compiled regular expressions. The value is 2, 3, or         internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
1426         4.  Larger  values  allow larger regular expressions to be compiled, at         library, the value can be 2, 3, or 4. For the 16-bit library, the value
1427         the expense of slower matching. The default value of  2  is  sufficient         is either 2 or 4 and is still a number of bytes. The default value of 2
1428         for  all  but  the  most massive patterns, since it allows the compiled         is sufficient for all but the most massive patterns,  since  it  allows
1429         pattern to be up to 64K in size.         the  compiled  pattern  to  be  up to 64K in size.  Larger values allow
1430           larger regular expressions to be compiled, at  the  expense  of  slower
1431           matching.
1432    
1433           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1434    
1435         The output is an integer that contains the threshold  above  which  the         The  output  is  an integer that contains the threshold above which the
1436         POSIX  interface  uses malloc() for output vectors. Further details are         POSIX interface uses malloc() for output vectors. Further  details  are
1437         given in the pcreposix documentation.         given in the pcreposix documentation.
1438    
1439           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1440    
1441         The output is a long integer that gives the default limit for the  num-         The  output is a long integer that gives the default limit for the num-
1442         ber  of  internal  matching  function calls in a pcre_exec() execution.         ber of internal matching function calls  in  a  pcre_exec()  execution.
1443         Further details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1444    
1445           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1446    
1447         The output is a long integer that gives the default limit for the depth         The output is a long integer that gives the default limit for the depth
1448         of   recursion  when  calling  the  internal  matching  function  in  a         of  recursion  when  calling  the  internal  matching  function  in   a
1449         pcre_exec() execution.  Further  details  are  given  with  pcre_exec()         pcre_exec()  execution.  Further  details  are  given  with pcre_exec()
1450         below.         below.
1451    
1452           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1453    
1454         The  output is an integer that is set to one if internal recursion when         The output is an integer that is set to one if internal recursion  when
1455         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
1456         the  stack  to remember their state. This is the usual way that PCRE is         the stack to remember their state. This is the usual way that  PCRE  is
1457         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
1458         on  the  heap  instead  of  recursive  function  calls.  In  this case,         on the  heap  instead  of  recursive  function  calls.  In  this  case,
1459         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1460         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1461    
1462    
# Line 1058  COMPILING A PATTERN Line 1473  COMPILING A PATTERN
1473    
1474         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
1475         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
1476         the  two interfaces is that pcre_compile2() has an additional argument,         the two interfaces is that pcre_compile2() has an additional  argument,
1477         errorcodeptr, via which a numerical error  code  can  be  returned.  To         errorcodeptr,  via  which  a  numerical  error code can be returned. To
1478         avoid  too  much repetition, we refer just to pcre_compile() below, but         avoid too much repetition, we refer just to pcre_compile()  below,  but
1479         the information applies equally to pcre_compile2().         the information applies equally to pcre_compile2().
1480    
1481         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
1482         the  pattern  argument.  A  pointer to a single block of memory that is         the pattern argument. A pointer to a single block  of  memory  that  is
1483         obtained via pcre_malloc is returned. This contains the  compiled  code         obtained  via  pcre_malloc is returned. This contains the compiled code
1484         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
1485         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
1486         It is up to the caller to free the memory (via pcre_free) when it is no         It is up to the caller to free the memory (via pcre_free) when it is no
1487         longer required.         longer required.
1488    
1489         Although the compiled code of a PCRE regex is relocatable, that is,  it         Although  the compiled code of a PCRE regex is relocatable, that is, it
1490         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1491         fully relocatable, because it may contain a copy of the tableptr  argu-         fully  relocatable, because it may contain a copy of the tableptr argu-
1492         ment, which is an address (see below).         ment, which is an address (see below).
1493    
1494         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
1495         pilation. It should be zero if no options are required.  The  available         pilation.  It  should be zero if no options are required. The available
1496         options  are  described  below. Some of them (in particular, those that         options are described below. Some of them (in  particular,  those  that
1497         are compatible with Perl, but some others as well) can also be set  and         are  compatible with Perl, but some others as well) can also be set and
1498         unset  from  within  the  pattern  (see the detailed description in the         unset from within the pattern (see  the  detailed  description  in  the
1499         pcrepattern documentation). For those options that can be different  in         pcrepattern  documentation). For those options that can be different in
1500         different  parts  of  the pattern, the contents of the options argument         different parts of the pattern, the contents of  the  options  argument
1501         specifies their settings at the start of compilation and execution. The         specifies their settings at the start of compilation and execution. The
1502         PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and         PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK,  and
1503         PCRE_NO_START_OPT options can be set at the time of matching as well as         PCRE_NO_START_OPT options can be set at the time of matching as well as
1504         at compile time.         at compile time.
1505    
1506         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1507         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1508         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1509         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
1510         try  to  free it. Normally, the offset from the start of the pattern to         try to free it. Normally, the offset from the start of the  pattern  to
1511         the byte that was being processed when  the  error  was  discovered  is         the  byte  that  was  being  processed when the error was discovered is
1512         placed  in the variable pointed to by erroffset, which must not be NULL         placed in the variable pointed to by erroffset, which must not be  NULL
1513         (if it is, an immediate error is given). However, for an invalid  UTF-8         (if  it is, an immediate error is given). However, for an invalid UTF-8
1514         string,  the offset is that of the first byte of the failing character.         string, the offset is that of the first byte of the failing character.
        Also, some errors are not detected until checks are  carried  out  when  
        the  whole  pattern  has been scanned; in these cases the offset passed  
        back is the length of the pattern.  
1515    
1516           Some errors are not detected until the whole pattern has been  scanned;
1517           in  these  cases,  the offset passed back is the length of the pattern.
1518         Note that the offset is in bytes, not characters, even in  UTF-8  mode.         Note that the offset is in bytes, not characters, even in  UTF-8  mode.
1519         It may sometimes point into the middle of a UTF-8 character.         It may sometimes point into the middle of a UTF-8 character.
1520    
# Line 1303  COMPILING A PATTERN Line 1717  COMPILING A PATTERN
1717         recognized. The Unicode newline sequences are the three just mentioned,         recognized. The Unicode newline sequences are the three just mentioned,
1718         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1719         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1720         (paragraph  separator,  U+2029).  The  last  two are recognized only in         (paragraph  separator, U+2029). For the 8-bit library, the last two are
1721         UTF-8 mode.         recognized only in UTF-8 mode.
1722    
1723         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
1724         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
# Line 1361  COMPILING A PATTERN Line 1775  COMPILING A PATTERN
1775           PCRE_UTF8           PCRE_UTF8
1776    
1777         This option causes PCRE to regard both the pattern and the  subject  as         This option causes PCRE to regard both the pattern and the  subject  as
1778         strings  of  UTF-8 characters instead of single-byte character strings.         strings of UTF-8 characters instead of single-byte strings. However, it
1779         However, it is available only when PCRE is built to include UTF-8  sup-         is available only when PCRE is built to include UTF  support.  If  not,
1780         port.  If not, the use of this option provokes an error. Details of how         the  use  of  this option provokes an error. Details of how this option
1781         this option changes the behaviour of PCRE are given in the  pcreunicode         changes the behaviour of PCRE are given in the pcreunicode page.
        page.  
1782    
1783           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1784    
1785         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1786         automatically checked. There is a  discussion  about  the  validity  of         automatically  checked.  There  is  a  discussion about the validity of
1787         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of         UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
1788         bytes is found, pcre_compile() returns an error. If  you  already  know         found,  pcre_compile()  returns an error. If you already know that your
1789         that your pattern is valid, and you want to skip this check for perfor-         pattern is valid, and you want to skip this check for performance  rea-
1790         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is         sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
1791         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is         effect of passing an invalid UTF-8 string as a pattern is undefined. It
1792         undefined. It may cause your program to crash. Note  that  this  option         may  cause  your  program  to  crash. Note that this option can also be
1793         can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the         passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
1794         UTF-8 validity checking of subject strings.         checking of subject strings.
1795    
1796    
1797  COMPILATION ERROR CODES  COMPILATION ERROR CODES
1798    
1799         The following table lists the error  codes  than  may  be  returned  by         The  following  table  lists  the  error  codes than may be returned by
1800         pcre_compile2(),  along with the error messages that may be returned by         pcre_compile2(), along with the error messages that may be returned  by
1801         both compiling functions. As PCRE has developed, some error codes  have         both  compiling  functions.  Note  that error messages are always 8-bit
1802         fallen out of use. To avoid confusion, they have not been re-used.         ASCII strings, even in 16-bit mode. As PCRE has developed,  some  error
1803           codes  have  fallen  out of use. To avoid confusion, they have not been
1804           re-used.
1805    
1806            0  no error            0  no error
1807            1  \ at end of pattern            1  \ at end of pattern
# Line 1420  COMPILATION ERROR CODES Line 1835  COMPILATION ERROR CODES
1835           29  (?R or (?[+-]digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
1836           30  unknown POSIX class name           30  unknown POSIX class name
1837           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
1838           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is compiled without UTF support
1839           33  [this code is not in use]           33  [this code is not in use]
1840           34  character value in \x{...} sequence is too large           34  character value in \x{...} sequence is too large
1841           35  invalid condition (?(0)           35  invalid condition (?(0)
# Line 1432  COMPILATION ERROR CODES Line 1847  COMPILATION ERROR CODES
1847           41  unrecognized character after (?P           41  unrecognized character after (?P
1848           42  syntax error in subpattern name (missing terminator)           42  syntax error in subpattern name (missing terminator)
1849           43  two named subpatterns have the same name           43  two named subpatterns have the same name
1850           44  invalid UTF-8 string           44  invalid UTF-8 string (specifically UTF-8)
1851           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
1852           46  malformed \P or \p sequence           46  malformed \P or \p sequence
1853           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1854           48  subpattern name is too long (maximum 32 characters)           48  subpattern name is too long (maximum 32 characters)
1855           49  too many named subpatterns (maximum 10000)           49  too many named subpatterns (maximum 10000)
1856           50  [this code is not in use]           50  [this code is not in use]
1857           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 in 8-bit non-UTF-8 mode
1858           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
1859           53  internal error: previously-checked referenced subpattern           53  internal error: previously-checked referenced subpattern
1860                 not found                 not found
# Line 1458  COMPILATION ERROR CODES Line 1873  COMPILATION ERROR CODES
1873           65  different names for subpatterns of the same number are           65  different names for subpatterns of the same number are
1874                 not allowed                 not allowed
1875           66  (*MARK) must have an argument           66  (*MARK) must have an argument
1876           67  this version of PCRE is not compiled with PCRE_UCP support           67  this version of PCRE is not compiled with Unicode property
1877                   support
1878           68  \c must be followed by an ASCII character           68  \c must be followed by an ASCII character
1879           69  \k is not followed by a braced, angle-bracketed, or quoted name           69  \k is not followed by a braced, angle-bracketed, or quoted name
1880             70  internal error: unknown opcode in find_fixedlength()
1881             71  \N is not supported in a class
1882             72  too many forward references
1883             73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
1884             74  invalid UTF-16 string (specifically UTF-16)
1885    
1886         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different         The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1887         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
1888    
1889    
# Line 1471  STUDYING A PATTERN Line 1892  STUDYING A PATTERN
1892         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1893              const char **errptr);              const char **errptr);
1894    
1895         If a compiled pattern is going to be used several times,  it  is  worth         If  a  compiled  pattern is going to be used several times, it is worth
1896         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1897         matching. The function pcre_study() takes a pointer to a compiled  pat-         matching.  The function pcre_study() takes a pointer to a compiled pat-
1898         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1899         information that will help speed up matching,  pcre_study()  returns  a         information  that  will  help speed up matching, pcre_study() returns a
1900         pointer  to a pcre_extra block, in which the study_data field points to         pointer to a pcre_extra block, in which the study_data field points  to
1901         the results of the study.         the results of the study.
1902    
1903         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1904         pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-         pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
1905         tains other fields that can be set by the caller before  the  block  is         tains  other  fields  that can be set by the caller before the block is
1906         passed; these are described below in the section on matching a pattern.         passed; these are described below in the section on matching a pattern.
1907    
1908         If  studying  the  pattern  does  not  produce  any useful information,         If studying the  pattern  does  not  produce  any  useful  information,
1909         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1910         wants   to   pass   any   of   the   other  fields  to  pcre_exec()  or         wants  to  pass  any  of   the   other   fields   to   pcre_exec()   or
1911         pcre_dfa_exec(), it must set up its own pcre_extra block.         pcre_dfa_exec(), it must set up its own pcre_extra block.
1912    
1913         The second argument of pcre_study() contains option bits. There is only         The second argument of pcre_study() contains option bits. There is only
1914         one  option:  PCRE_STUDY_JIT_COMPILE.  If this is set, and the just-in-         one option: PCRE_STUDY_JIT_COMPILE. If this is set,  and  the  just-in-
1915         time compiler is  available,  the  pattern  is  further  compiled  into         time  compiler  is  available,  the  pattern  is  further compiled into
1916         machine  code  that  executes much faster than the pcre_exec() matching         machine code that executes much faster than  the  pcre_exec()  matching
1917         function. If the just-in-time compiler is not available, this option is         function. If the just-in-time compiler is not available, this option is
1918         ignored. All other bits in the options argument must be zero.         ignored. All other bits in the options argument must be zero.
1919    
1920         JIT  compilation  is  a heavyweight optimization. It can take some time         JIT compilation is a heavyweight optimization. It can  take  some  time
1921         for patterns to be analyzed, and for one-off matches  and  simple  pat-         for  patterns  to  be analyzed, and for one-off matches and simple pat-
1922         terns  the benefit of faster execution might be offset by a much slower         terns the benefit of faster execution might be offset by a much  slower
1923         study time.  Not all patterns can be optimized by the JIT compiler. For         study time.  Not all patterns can be optimized by the JIT compiler. For
1924         those  that cannot be handled, matching automatically falls back to the         those that cannot be handled, matching automatically falls back to  the
1925         pcre_exec() interpreter. For more details, see the  pcrejit  documenta-         pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
1926         tion.         tion.
1927    
1928         The  third argument for pcre_study() is a pointer for an error message.         The third argument for pcre_study() is a pointer for an error  message.
1929         If studying succeeds (even if no data is  returned),  the  variable  it         If  studying  succeeds  (even  if no data is returned), the variable it
1930         points  to  is  set  to NULL. Otherwise it is set to point to a textual         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1931         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1932         must  not  try  to  free it. You should test the error pointer for NULL         must not try to free it. You should test the  error  pointer  for  NULL
1933         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1934    
1935         When you are finished with a pattern, you can free the memory used  for         When  you are finished with a pattern, you can free the memory used for
1936         the study data by calling pcre_free_study(). This function was added to         the study data by calling pcre_free_study(). This function was added to
1937         the API for release 8.20. For earlier versions,  the  memory  could  be         the  API  for  release  8.20. For earlier versions, the memory could be
1938         freed  with  pcre_free(), just like the pattern itself. This will still         freed with pcre_free(), just like the pattern itself. This  will  still
1939         work in cases where PCRE_STUDY_JIT_COMPILE  is  not  used,  but  it  is         work  in  cases  where  PCRE_STUDY_JIT_COMPILE  is  not used, but it is
1940         advisable to change to the new function when convenient.         advisable to change to the new function when convenient.
1941    
1942         This  is  a typical way in which pcre_study() is used (except that in a         This is a typical way in which pcre_study() is used (except that  in  a
1943         real application there should be tests for errors):         real application there should be tests for errors):
1944    
1945           int rc;           int rc;
# Line 1538  STUDYING A PATTERN Line 1959  STUDYING A PATTERN
1959         Studying a pattern does two things: first, a lower bound for the length         Studying a pattern does two things: first, a lower bound for the length
1960         of subject string that is needed to match the pattern is computed. This         of subject string that is needed to match the pattern is computed. This
1961         does not mean that there are any strings of that length that match, but         does not mean that there are any strings of that length that match, but
1962         it  does  guarantee that no shorter strings match. The value is used by         it does guarantee that no shorter strings match. The value is  used  by
1963         pcre_exec() and pcre_dfa_exec() to avoid  wasting  time  by  trying  to         pcre_exec()  and  pcre_dfa_exec()  to  avoid  wasting time by trying to
1964         match  strings  that are shorter than the lower bound. You can find out         match strings that are shorter than the lower bound. You can  find  out
1965         the value in a calling program via the pcre_fullinfo() function.         the value in a calling program via the pcre_fullinfo() function.
1966    
1967         Studying a pattern is also useful for non-anchored patterns that do not         Studying a pattern is also useful for non-anchored patterns that do not
1968         have  a  single fixed starting character. A bitmap of possible starting         have a single fixed starting character. A bitmap of  possible  starting
1969         bytes is created. This speeds up finding a position in the  subject  at         bytes  is  created. This speeds up finding a position in the subject at
1970         which to start matching.         which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
1971           values less than 256.)
1972    
1973         These  two optimizations apply to both pcre_exec() and pcre_dfa_exec().         These  two optimizations apply to both pcre_exec() and pcre_dfa_exec().
1974         However, they are not used by pcre_exec()  if  pcre_study()  is  called         However, they are not used by pcre_exec()  if  pcre_study()  is  called
# Line 1623  INFORMATION ABOUT A PATTERN Line 2045  INFORMATION ABOUT A PATTERN
2045              int what, void *where);              int what, void *where);
2046    
2047         The pcre_fullinfo() function returns information about a compiled  pat-         The pcre_fullinfo() function returns information about a compiled  pat-
2048         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern.  It replaces the pcre_info() function, which was removed from the
2049         less retained for backwards compability (and is documented below).         library at version 8.30, after more than 10 years of obsolescence.
2050    
2051         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
2052         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern.  The second argument is the result of pcre_study(), or NULL if
# Line 1633  INFORMATION ABOUT A PATTERN Line 2055  INFORMATION ABOUT A PATTERN
2055         variable to receive the data. The yield of the  function  is  zero  for         variable to receive the data. The yield of the  function  is  zero  for
2056         success, or one of the following negative numbers:         success, or one of the following negative numbers:
2057    
2058           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL           the argument code was NULL
2059                                 the argument where was NULL                                     the argument where was NULL
2060           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC       the "magic number" was not found
2061           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
2062                                       endianness
2063             PCRE_ERROR_BADOPTION      the value of what was invalid
2064    
2065         The  "magic  number" is placed at the start of each compiled pattern as         The  "magic  number" is placed at the start of each compiled pattern as
2066         an simple check against passing an arbitrary memory pointer. Here is  a         an simple check against passing an arbitrary memory pointer. The  endi-
2067         typical  call  of pcre_fullinfo(), to obtain the length of the compiled         anness error can occur if a compiled pattern is saved and reloaded on a
2068         pattern:         different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
2069           the length of the compiled pattern:
2070    
2071           int rc;           int rc;
2072           size_t length;           size_t length;
# Line 1651  INFORMATION ABOUT A PATTERN Line 2076  INFORMATION ABOUT A PATTERN
2076             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
2077             &length);         /* where to put the data */             &length);         /* where to put the data */
2078    
2079         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
2080         are as follows:         are as follows:
2081    
2082           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
2083    
2084         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
2085         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
2086         there are no back references.         there are no back references.
2087    
2088           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
2089    
2090         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
2091         argument should point to an int variable.         argument should point to an int variable.
2092    
2093           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
2094    
2095         Return a pointer to the internal default character tables within  PCRE.         Return  a pointer to the internal default character tables within PCRE.
2096         The  fourth  argument should point to an unsigned char * variable. This         The fourth argument should point to an unsigned char *  variable.  This
2097         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
2098         tion.  External  callers  can  cause PCRE to use its internal tables by         tion. External callers can cause PCRE to use  its  internal  tables  by
2099         passing a NULL table pointer.         passing a NULL table pointer.
2100    
2101           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
2102    
2103         Return information about the first byte of any matched  string,  for  a         Return information about the first data unit of any matched string, for
2104         non-anchored  pattern. The fourth argument should point to an int vari-         a non-anchored pattern. (The name of this option refers  to  the  8-bit
2105         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name         library,  where data units are bytes.) The fourth argument should point
2106         is still recognized for backwards compatibility.)         to an int variable.
2107    
2108           If there is a fixed first value, for example, the  letter  "c"  from  a
2109           pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
2110           library, the value is always less than 256; in the 16-bit  library  the
2111           value can be up to 0xffff.
2112    
2113         If  there  is  a  fixed first byte, for example, from a pattern such as         If there is no fixed first value, and if either
        (cat|cow|coyote), its value is returned. Otherwise, if either  
2114    
2115         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
2116         branch starts with "^", or         branch starts with "^", or
2117    
2118         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2119         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
2120    
2121         -1 is returned, indicating that the pattern matches only at  the  start         -1  is  returned, indicating that the pattern matches only at the start
2122         of  a  subject string or after any newline within the string. Otherwise         of a subject string or after any newline within the  string.  Otherwise
2123         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
2124    
2125           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
2126    
2127         If the pattern was studied, and this resulted in the construction of  a         If  the pattern was studied, and this resulted in the construction of a
2128         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of values for the first data  unit
2129         matching string, a pointer to the table is returned. Otherwise NULL  is         in  any  matching string, a pointer to the table is returned. Otherwise
2130         returned.  The fourth argument should point to an unsigned char * vari-         NULL is returned. The fourth argument should point to an unsigned  char
2131         able.         * variable.
2132    
2133           PCRE_INFO_HASCRORLF           PCRE_INFO_HASCRORLF
2134    
2135         Return 1 if the pattern contains any explicit  matches  for  CR  or  LF         Return  1  if  the  pattern  contains any explicit matches for CR or LF
2136         characters,  otherwise  0.  The  fourth argument should point to an int         characters, otherwise 0. The fourth argument should  point  to  an  int
2137         variable. An explicit match is either a literal CR or LF character,  or         variable.  An explicit match is either a literal CR or LF character, or
2138         \r or \n.         \r or \n.
2139    
2140           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
2141    
2142         Return  1  if  the (?J) or (?-J) option setting is used in the pattern,         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2143         otherwise 0. The fourth argument should point to an int variable.  (?J)         otherwise  0. The fourth argument should point to an int variable. (?J)
2144         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2145    
2146           PCRE_INFO_JIT           PCRE_INFO_JIT
2147    
2148         Return  1  if  the  pattern was studied with the PCRE_STUDY_JIT_COMPILE         Return 1 if the pattern was  studied  with  the  PCRE_STUDY_JIT_COMPILE
2149         option, and just-in-time compiling was successful. The fourth  argument         option,  and just-in-time compiling was successful. The fourth argument
2150         should  point  to  an  int variable. A return value of 0 means that JIT         should point to an int variable. A return value of  0  means  that  JIT
2151         support is not available in this version of PCRE, or that  the  pattern         support  is  not available in this version of PCRE, or that the pattern
2152         was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT         was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT
2153         compiler could not handle this particular pattern. See the pcrejit doc-         compiler could not handle this particular pattern. See the pcrejit doc-
2154         umentation for details of what can and cannot be handled.         umentation for details of what can and cannot be handled.
# Line 1727  INFORMATION ABOUT A PATTERN Line 2156  INFORMATION ABOUT A PATTERN
2156           PCRE_INFO_JITSIZE           PCRE_INFO_JITSIZE
2157    
2158         If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE         If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
2159         option, return the size of the  JIT  compiled  code,  otherwise  return         option,  return  the  size  of  the JIT compiled code, otherwise return
2160         zero. The fourth argument should point to a size_t variable.         zero. The fourth argument should point to a size_t variable.
2161    
2162           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
2163    
2164         Return  the  value of the rightmost literal byte that must exist in any         Return the value of the rightmost literal data unit that must exist  in
2165         matched string, other than at its  start,  if  such  a  byte  has  been         any  matched  string, other than at its start, if such a value has been
2166         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
2167         is no such byte, -1 is returned. For anchored patterns, a last  literal         is no such value, -1 is returned. For anchored patterns, a last literal
2168         byte  is  recorded only if it follows something of variable length. For         value is recorded only if it follows something of variable length.  For
2169         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2170         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
2171    
2172           PCRE_INFO_MINLENGTH           PCRE_INFO_MINLENGTH
2173    
2174         If  the  pattern  was studied and a minimum length for matching subject         If the pattern was studied and a minimum length  for  matching  subject
2175         strings was computed, its value is  returned.  Otherwise  the  returned         strings  was  computed,  its  value is returned. Otherwise the returned
2176         value  is  -1. The value is a number of characters, not bytes (this may         value is -1. The value is a number of characters, which in  UTF-8  mode
2177         be relevant in UTF-8 mode). The fourth argument should point to an  int         may  be  different from the number of bytes. The fourth argument should
2178         variable.  A  non-negative  value is a lower bound to the length of any         point to an int variable. A non-negative value is a lower bound to  the
2179         matching string. There may not be any strings of that  length  that  do         length  of  any  matching  string. There may not be any strings of that
2180         actually match, but every string that does match is at least that long.         length that do actually match, but every string that does match  is  at
2181           least that long.
2182    
2183           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
2184           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
# Line 1768  INFORMATION ABOUT A PATTERN Line 2198  INFORMATION ABOUT A PATTERN
2198         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2199         of  each  entry;  both  of  these  return  an int value. The entry size         of  each  entry;  both  of  these  return  an int value. The entry size
2200         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
2201         a  pointer  to  the  first  entry of the table (a pointer to char). The         a pointer to the first entry of the table. This is a pointer to char in
2202         first two bytes of each entry are the number of the capturing parenthe-         the 8-bit library, where the first two bytes of each entry are the num-
2203         sis,  most  significant byte first. The rest of the entry is the corre-         ber  of  the capturing parenthesis, most significant byte first. In the
2204         sponding name, zero terminated.         16-bit library, the pointer points to 16-bit data units, the  first  of
2205           which  contains  the  parenthesis  number. The rest of the entry is the
2206           corresponding name, zero terminated.
2207    
2208         The names are in alphabetical order. Duplicate names may appear if  (?|         The names are in alphabetical order. Duplicate names may appear if  (?|
2209         is used to create multiple groups with the same number, as described in         is used to create multiple groups with the same number, as described in
# Line 1784  INFORMATION ABOUT A PATTERN Line 2216  INFORMATION ABOUT A PATTERN
2216         terns may have lower numbers.         terns may have lower numbers.
2217    
2218         As a simple example of the name/number table,  consider  the  following         As a simple example of the name/number table,  consider  the  following
2219         pattern  (assume  PCRE_EXTENDED is set, so white space - including new-         pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2220         lines - is ignored):         set, so white space - including newlines - is ignored):
2221    
2222           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
2223           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
# Line 1838  INFORMATION ABOUT A PATTERN Line 2270  INFORMATION ABOUT A PATTERN
2270    
2271           PCRE_INFO_SIZE           PCRE_INFO_SIZE
2272    
2273         Return  the  size  of  the compiled pattern. The fourth argument should         Return  the size of the compiled pattern in bytes (for both libraries).
2274         point to a size_t variable. This value does not include the size of the         The fourth argument should point to a size_t variable. This value  does
2275         pcre  structure  that  is returned by pcre_compile(). The value that is         not  include  the  size  of  the  pcre  structure  that  is returned by
2276         passed as the argument to pcre_malloc() when pcre_compile() is  getting         pcre_compile(). The value that is passed as the argument  to  pcre_mal-
2277         memory  in  which  to  place the compiled data is the value returned by         loc()  when pcre_compile() is getting memory in which to place the com-
2278         this option plus the size of the pcre structure.  Studying  a  compiled         piled data is the value returned by this option plus the  size  of  the
2279         pattern, with or without JIT, does not alter the value returned by this         pcre  structure. Studying a compiled pattern, with or without JIT, does
2280         option.         not alter the value returned by this option.
2281    
2282           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
2283    
2284         Return the size of the data block pointed to by the study_data field in         Return the size in bytes of the data block pointed to by the study_data
2285         a  pcre_extra  block. If pcre_extra is NULL, or there is no study data,         field  in  a  pcre_extra  block.  If pcre_extra is NULL, or there is no
2286         zero is returned. The fourth argument should point to  a  size_t  vari-         study data, zero is returned. The fourth argument  should  point  to  a
2287         able.   The  study_data field is set by pcre_study() to record informa-         size_t  variable. The study_data field is set by pcre_study() to record
2288         tion that will speed up matching (see the section entitled "Studying  a         information that will speed  up  matching  (see  the  section  entitled
2289         pattern" above). The format of the study_data block is private, but its         "Studying a pattern" above). The format of the study_data block is pri-
2290         length is made available via this option so that it can  be  saved  and         vate, but its length is made available via this option so that  it  can
2291         restored (see the pcreprecompile documentation for details).         be  saved  and  restored  (see  the  pcreprecompile  documentation  for
2292           details).
   
 OBSOLETE INFO FUNCTION  
   
        int pcre_info(const pcre *code, int *optptr, int *firstcharptr);  
   
        The  pcre_info()  function is now obsolete because its interface is too  
        restrictive to return all the available data about a compiled  pattern.  
        New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of  
        pcre_info() is the number of capturing subpatterns, or one of the  fol-  
        lowing negative numbers:  
   
          PCRE_ERROR_NULL       the argument code was NULL  
          PCRE_ERROR_BADMAGIC   the "magic number" was not found  
   
        If  the  optptr  argument is not NULL, a copy of the options with which  
        the pattern was compiled is placed in the integer  it  points  to  (see  
        PCRE_INFO_OPTIONS above).  
   
        If  the  pattern  is  not anchored and the firstcharptr argument is not  
        NULL, it is used to pass back information about the first character  of  
        any matched string (see PCRE_INFO_FIRSTBYTE above).  
2293    
2294    
2295  REFERENCE COUNTS  REFERENCE COUNTS
2296    
2297         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
2298    
2299         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
2300         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
2301         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
2302         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
2303         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
2304    
2305         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
2306         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
2307         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
2308         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
2309         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
2310         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
2311    
2312         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
2313         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
2314         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
2315    
2316    
# Line 1909  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2320  MATCHING A PATTERN: THE TRADITIONAL FUNC
2320              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
2321              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
2322    
2323         The function pcre_exec() is called to match a subject string against  a         The  function pcre_exec() is called to match a subject string against a
2324         compiled  pattern, which is passed in the code argument. If the pattern         compiled pattern, which is passed in the code argument. If the  pattern
2325         was studied, the result of the study should  be  passed  in  the  extra         was  studied,  the  result  of  the study should be passed in the extra
2326         argument.  You  can call pcre_exec() with the same code and extra argu-         argument. You can call pcre_exec() with the same code and  extra  argu-
2327         ments as many times as you like, in order to  match  different  subject         ments  as  many  times as you like, in order to match different subject
2328         strings with the same pattern.         strings with the same pattern.
2329    
2330         This  function  is  the  main  matching facility of the library, and it         This function is the main matching facility  of  the  library,  and  it
2331         operates in a Perl-like manner. For specialist use  there  is  also  an         operates  in  a  Perl-like  manner. For specialist use there is also an
2332         alternative  matching function, which is described below in the section         alternative matching function, which is described below in the  section
2333         about the pcre_dfa_exec() function.         about the pcre_dfa_exec() function.
2334    
2335         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
2336         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
2337         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
2338         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
2339         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
2340    
2341         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1943  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2354  MATCHING A PATTERN: THE TRADITIONAL FUNC
2354    
2355     Extra data for pcre_exec()     Extra data for pcre_exec()
2356    
2357         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
2358         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
2359         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
2360         tional  information  in it. The pcre_extra block contains the following         tional information in it. The pcre_extra block contains  the  following
2361         fields (not necessarily in this order):         fields (not necessarily in this order):
2362    
2363           unsigned long int flags;           unsigned long int flags;
# Line 1958  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2369  MATCHING A PATTERN: THE TRADITIONAL FUNC
2369           const unsigned char *tables;           const unsigned char *tables;
2370           unsigned char **mark;           unsigned char **mark;
2371    
2372           In  the  16-bit  version  of  this  structure,  the mark field has type
2373           "PCRE_UCHAR16 **".
2374    
2375         The flags field is a bitmap that specifies which of  the  other  fields         The flags field is a bitmap that specifies which of  the  other  fields
2376         are set. The flag bits are:         are set. The flag bits are:
2377    
# Line 2036  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2450  MATCHING A PATTERN: THE TRADITIONAL FUNC
2450         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
2451    
2452         If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be         If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
2453         set to point to a char * variable. If the pattern  contains  any  back-         set to point to a suitable variable. If the pattern contains any  back-
2454         tracking  control verbs such as (*MARK:NAME), and the execution ends up         tracking  control verbs such as (*MARK:NAME), and the execution ends up
2455         with a name to pass back, a pointer to the  name  string  (zero  termi-         with a name to pass back, a pointer to the  name  string  (zero  termi-
2456         nated)  is  placed  in  the  variable pointed to by the mark field. The         nated)  is  placed  in  the  variable pointed to by the mark field. The
2457         names are within the compiled pattern; if you wish  to  retain  such  a         names are within the compiled pattern; if you wish  to  retain  such  a
2458         name  you must copy it before freeing the memory of a compiled pattern.         name  you must copy it before freeing the memory of a compiled pattern.
2459         If there is no name to pass back, the variable pointed to by  the  mark         If there is no name to pass back, the variable pointed to by  the  mark
2460         field  set  to NULL. For details of the backtracking control verbs, see         field  is  set  to NULL. For details of the backtracking control verbs,
2461         the section entitled "Backtracking control" in the pcrepattern documen-         see the section entitled "Backtracking control" in the pcrepattern doc-
2462         tation.         umentation.
2463    
2464     Option bits for pcre_exec()     Option bits for pcre_exec()
2465    
# Line 2219  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2633  MATCHING A PATTERN: THE TRADITIONAL FUNC
2633         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2634         called.  The value of startoffset is also checked  to  ensure  that  it         called.  The value of startoffset is also checked  to  ensure  that  it
2635         points  to  the start of a UTF-8 character. There is a discussion about         points  to  the start of a UTF-8 character. There is a discussion about
2636         the validity of UTF-8 strings in the section on UTF-8  support  in  the         the validity of UTF-8 strings in the pcreunicode page.  If  an  invalid
2637         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,         sequence   of   bytes   is   found,   pcre_exec()   returns  the  error
2638         pcre_exec() returns  the  error  PCRE_ERROR_BADUTF8  or,  if  PCRE_PAR-         PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
2639         TIAL_HARD  is set and the problem is a truncated UTF-8 character at the         truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
2640         end of the subject, PCRE_ERROR_SHORTUTF8. In  both  cases,  information         both cases, information about the precise nature of the error may  also
2641         about  the  precise  nature  of the error may also be returned (see the         be  returned (see the descriptions of these errors in the section enti-
2642         descriptions of these errors in the section entitled Error return  val-         tled Error return values from pcre_exec() below).  If startoffset  con-
2643         ues from pcre_exec() below).  If startoffset contains a value that does         tains a value that does not point to the start of a UTF-8 character (or
2644         not point to the start of a UTF-8 character (or to the end of the  sub-         to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2645         ject), PCRE_ERROR_BADUTF8_OFFSET is returned.  
2646           If you already know that your subject is valid, and you  want  to  skip
2647         If  you  already  know that your subject is valid, and you want to skip         these    checks    for   performance   reasons,   you   can   set   the
2648         these   checks   for   performance   reasons,   you   can    set    the         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
2649         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         do  this  for the second and subsequent calls to pcre_exec() if you are
2650         do this for the second and subsequent calls to pcre_exec() if  you  are         making repeated calls to find all  the  matches  in  a  single  subject
2651         making  repeated  calls  to  find  all  the matches in a single subject         string.  However,  you  should  be  sure  that the value of startoffset
2652         string. However, you should be  sure  that  the  value  of  startoffset         points to the start of a character (or the end of  the  subject).  When
2653         points  to  the start of a UTF-8 character (or the end of the subject).         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
2654         When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid  UTF-8         subject or an invalid value of startoffset is undefined.  Your  program
2655         string  as  a  subject or an invalid value of startoffset is undefined.         may crash.
        Your program may crash.  
2656    
2657           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2658           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2659    
2660         These options turn on the partial matching feature. For backwards  com-         These  options turn on the partial matching feature. For backwards com-
2661         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
2662         match occurs if the end of the subject string is reached  successfully,         match  occurs if the end of the subject string is reached successfully,
2663         but  there  are not enough subject characters to complete the match. If         but there are not enough subject characters to complete the  match.  If
2664         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2665         matching  continues  by  testing any remaining alternatives. Only if no         matching continues by testing any remaining alternatives.  Only  if  no
2666         complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
2667         PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the         PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
2668         caller is prepared to handle a partial match, but only if  no  complete         caller  is  prepared to handle a partial match, but only if no complete
2669         match can be found.         match can be found.
2670    
2671         If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this         If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
2672         case, if a partial match  is  found,  pcre_exec()  immediately  returns         case,  if  a  partial  match  is found, pcre_exec() immediately returns
2673         PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In         PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
2674         other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-         other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
2675         ered to be more important that an alternative complete match.         ered to be more important that an alternative complete match.
2676    
2677         In  both  cases,  the portion of the string that was inspected when the         In both cases, the portion of the string that was  inspected  when  the
2678         partial match was found is set as the first matching string. There is a         partial match was found is set as the first matching string. There is a
2679         more  detailed  discussion  of partial and multi-segment matching, with         more detailed discussion of partial and  multi-segment  matching,  with
2680         examples, in the pcrepartial documentation.         examples, in the pcrepartial documentation.
2681    
2682     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2683    
2684         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
2685         length (in bytes) in length, and a starting byte offset in startoffset.         length in bytes in length, and a starting byte offset  in  startoffset.
2686         If this is  negative  or  greater  than  the  length  of  the  subject,         If  this  is  negative  or  greater  than  the  length  of the subject,
2687         pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
2688         zero, the search for a match starts at the beginning  of  the  subject,         zero,  the  search  for a match starts at the beginning of the subject,
2689         and this is by far the most common case. In UTF-8 mode, the byte offset         and this is by far the most common case. In UTF-8 mode, the byte offset
2690         must point to the start of a UTF-8 character (or the end  of  the  sub-         must  point  to  the start of a UTF-8 character (or the end of the sub-
2691         ject).  Unlike  the pattern string, the subject may contain binary zero         ject). Unlike the pattern string, the subject may contain  binary  zero
2692         bytes.         bytes.
2693    
2694         A non-zero starting offset is useful when searching for  another  match         A  non-zero  starting offset is useful when searching for another match
2695         in  the same subject by calling pcre_exec() again after a previous suc-         in the same subject by calling pcre_exec() again after a previous  suc-
2696         cess.  Setting startoffset differs from just passing over  a  shortened         cess.   Setting  startoffset differs from just passing over a shortened
2697         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
2698         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2699    
2700           \Biss\B           \Biss\B
2701    
2702         which finds occurrences of "iss" in the middle of  words.  (\B  matches         which  finds  occurrences  of "iss" in the middle of words. (\B matches
2703         only  if  the  current position in the subject is not a word boundary.)         only if the current position in the subject is not  a  word  boundary.)
2704         When applied to the string "Mississipi" the first call  to  pcre_exec()         When  applied  to the string "Mississipi" the first call to pcre_exec()
2705         finds  the  first  occurrence. If pcre_exec() is called again with just         finds the first occurrence. If pcre_exec() is called  again  with  just
2706         the remainder of the subject,  namely  "issipi",  it  does  not  match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
2707         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2708         to be a word boundary. However, if pcre_exec()  is  passed  the  entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
2709         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2710         rence of "iss" because it is able to look behind the starting point  to         rence  of "iss" because it is able to look behind the starting point to
2711         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2712    
2713         Finding  all  the  matches  in a subject is tricky when the pattern can         Finding all the matches in a subject is tricky  when  the  pattern  can
2714         match an empty string. It is possible to emulate Perl's /g behaviour by         match an empty string. It is possible to emulate Perl's /g behaviour by
2715         first   trying   the   match   again  at  the  same  offset,  with  the         first  trying  the  match  again  at  the   same   offset,   with   the
2716         PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that         PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that
2717         fails,  advancing  the  starting  offset  and  trying an ordinary match         fails, advancing the starting  offset  and  trying  an  ordinary  match
2718         again. There is some code that demonstrates how to do this in the pcre-         again. There is some code that demonstrates how to do this in the pcre-
2719         demo sample program. In the most general case, you have to check to see         demo sample program. In the most general case, you have to check to see
2720         if the newline convention recognizes CRLF as a newline, and if so,  and         if  the newline convention recognizes CRLF as a newline, and if so, and
2721         the current character is CR followed by LF, advance the starting offset         the current character is CR followed by LF, advance the starting offset
2722         by two characters instead of one.         by two characters instead of one.
2723    
2724         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
2725         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2726         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
2727         subject.         subject.
2728    
2729     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
2730    
2731         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
2732         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
2733         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
2734         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
2735         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
2736         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
2737         that do not cause substrings to be captured.         that do not cause substrings to be captured.
2738    
2739         Captured substrings are returned to the caller via a vector of integers         Captured substrings are returned to the caller via a vector of integers
2740         whose address is passed in ovector. The number of elements in the  vec-         whose  address is passed in ovector. The number of elements in the vec-
2741         tor  is  passed in ovecsize, which must be a non-negative number. Note:         tor is passed in ovecsize, which must be a non-negative  number.  Note:
2742         this argument is NOT the size of ovector in bytes.         this argument is NOT the size of ovector in bytes.
2743    
2744         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
2745         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
2746         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
2747         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
2748         The number passed in ovecsize should always be a multiple of three.  If         The  number passed in ovecsize should always be a multiple of three. If
2749         it is not, it is rounded down.         it is not, it is rounded down.
2750    
2751         When  a  match  is successful, information about captured substrings is         When a match is successful, information about  captured  substrings  is
2752         returned in pairs of integers, starting at the  beginning  of  ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
2753         and  continuing  up  to two-thirds of its length at the most. The first         and continuing up to two-thirds of its length at the  most.  The  first
2754         element of each pair is set to the byte offset of the  first  character         element  of  each pair is set to the byte offset of the first character
2755         in  a  substring, and the second is set to the byte offset of the first         in a substring, and the second is set to the byte offset of  the  first
2756         character after the end of a substring. Note: these values  are  always         character  after  the end of a substring. Note: these values are always
2757         byte offsets, even in UTF-8 mode. They are not character counts.         byte offsets, even in UTF-8 mode. They are not character counts.
2758    
2759         The  first  pair  of  integers, ovector[0] and ovector[1], identify the         The first pair of integers, ovector[0]  and  ovector[1],  identify  the
2760         portion of the subject string matched by the entire pattern.  The  next         portion  of  the subject string matched by the entire pattern. The next
2761         pair  is  used for the first capturing subpattern, and so on. The value         pair is used for the first capturing subpattern, and so on.  The  value
2762         returned by pcre_exec() is one more than the highest numbered pair that         returned by pcre_exec() is one more than the highest numbered pair that
2763         has  been  set.  For example, if two substrings have been captured, the         has been set.  For example, if two substrings have been  captured,  the
2764         returned value is 3. If there are no capturing subpatterns, the  return         returned  value is 3. If there are no capturing subpatterns, the return
2765         value from a successful match is 1, indicating that just the first pair         value from a successful match is 1, indicating that just the first pair
2766         of offsets has been set.         of offsets has been set.
2767    
2768         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
2769         of the string that it matched that is returned.         of the string that it matched that is returned.
2770    
2771         If  the vector is too small to hold all the captured substring offsets,         If the vector is too small to hold all the captured substring  offsets,
2772         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
2773         function  returns a value of zero. If neither the actual string matched         function returns a value of zero. If neither the actual string  matched
2774         not any captured substrings are of interest, pcre_exec() may be  called         not  any captured substrings are of interest, pcre_exec() may be called
2775         with  ovector passed as NULL and ovecsize as zero. However, if the pat-         with ovector passed as NULL and ovecsize as zero. However, if the  pat-
2776         tern contains back references and the ovector  is  not  big  enough  to         tern  contains  back  references  and  the ovector is not big enough to
2777         remember  the related substrings, PCRE has to get additional memory for         remember the related substrings, PCRE has to get additional memory  for
2778         use during matching. Thus it is usually advisable to supply an  ovector         use  during matching. Thus it is usually advisable to supply an ovector
2779         of reasonable size.         of reasonable size.
2780    
2781         There  are  some  cases where zero is returned (indicating vector over-         There are some cases where zero is returned  (indicating  vector  over-
2782         flow) when in fact the vector is exactly the right size for  the  final         flow)  when  in fact the vector is exactly the right size for the final
2783         match. For example, consider the pattern         match. For example, consider the pattern
2784    
2785           (a)(?:(b)c|bd)           (a)(?:(b)c|bd)
2786    
2787         If  a  vector of 6 elements (allowing for only 1 captured substring) is         If a vector of 6 elements (allowing for only 1 captured  substring)  is
2788         given with subject string "abd", pcre_exec() will try to set the second         given with subject string "abd", pcre_exec() will try to set the second
2789         captured string, thereby recording a vector overflow, before failing to         captured string, thereby recording a vector overflow, before failing to
2790         match "c" and backing up  to  try  the  second  alternative.  The  zero         match  "c"  and  backing  up  to  try  the second alternative. The zero
2791         return,  however,  does  correctly  indicate that the maximum number of         return, however, does correctly indicate that  the  maximum  number  of
2792         slots (namely 2) have been filled. In similar cases where there is tem-         slots (namely 2) have been filled. In similar cases where there is tem-
2793         porary  overflow,  but  the final number of used slots is actually less         porary overflow, but the final number of used slots  is  actually  less
2794         than the maximum, a non-zero value is returned.         than the maximum, a non-zero value is returned.
2795    
2796         The pcre_fullinfo() function can be used to find out how many capturing         The pcre_fullinfo() function can be used to find out how many capturing
2797         subpatterns  there  are  in  a  compiled pattern. The smallest size for         subpatterns there are in a compiled  pattern.  The  smallest  size  for
2798         ovector that will allow for n captured substrings, in addition  to  the         ovector  that  will allow for n captured substrings, in addition to the
2799         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
2800    
2801         It  is  possible for capturing subpattern number n+1 to match some part         It is possible for capturing subpattern number n+1 to match  some  part
2802         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
2803         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
2804         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
2805         2  is  not.  When  this happens, both values in the offset pairs corre-         2 is not. When this happens, both values in  the  offset  pairs  corre-
2806         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
2807    
2808         Offset values that correspond to unused subpatterns at the end  of  the         Offset  values  that correspond to unused subpatterns at the end of the
2809         expression  are  also  set  to  -1. For example, if the string "abc" is         expression are also set to -1. For example,  if  the  string  "abc"  is
2810         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2811         matched.  The  return  from the function is 2, because the highest used         matched. The return from the function is 2, because  the  highest  used
2812         capturing subpattern number is 1, and the offsets for  for  the  second         capturing  subpattern  number  is 1, and the offsets for for the second
2813         and  third  capturing subpatterns (assuming the vector is large enough,         and third capturing subpatterns (assuming the vector is  large  enough,
2814         of course) are set to -1.         of course) are set to -1.
2815    
2816         Note: Elements in the first two-thirds of ovector that  do  not  corre-         Note:  Elements  in  the first two-thirds of ovector that do not corre-
2817         spond  to  capturing parentheses in the pattern are never changed. That         spond to capturing parentheses in the pattern are never  changed.  That
2818         is, if a pattern contains n capturing parentheses, no more  than  ovec-         is,  if  a pattern contains n capturing parentheses, no more than ovec-
2819         tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in         tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements  (in
2820         the first two-thirds) retain whatever values they previously had.         the first two-thirds) retain whatever values they previously had.
2821    
2822         Some convenience functions are provided  for  extracting  the  captured         Some  convenience  functions  are  provided for extracting the captured
2823         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
2824    
2825     Error return values from pcre_exec()     Error return values from pcre_exec()
2826    
2827         If  pcre_exec()  fails, it returns a negative number. The following are         If pcre_exec() fails, it returns a negative number. The  following  are
2828         defined in the header file:         defined in the header file:
2829    
2830           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 2420  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2833  MATCHING A PATTERN: THE TRADITIONAL FUNC
2833    
2834           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
2835    
2836         Either code or subject was passed as NULL,  or  ovector  was  NULL  and         Either  code  or  subject  was  passed as NULL, or ovector was NULL and
2837         ovecsize was not zero.         ovecsize was not zero.
2838    
2839           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 2429  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2842  MATCHING A PATTERN: THE TRADITIONAL FUNC
2842    
2843           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
2844    
2845         PCRE  stores a 4-byte "magic number" at the start of the compiled code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
2846         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
2847         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
2848         an environment with the other endianness. This is the error  that  PCRE         an  environment  with the other endianness. This is the error that PCRE
2849         gives when the magic number is not present.         gives when the magic number is not present.
2850    
2851           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
2852    
2853         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
2854         compiled pattern. This error could be caused by a bug  in  PCRE  or  by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
2855         overwriting of the compiled pattern.         overwriting of the compiled pattern.
2856    
2857           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2858    
2859         If  a  pattern contains back references, but the ovector that is passed         If a pattern contains back references, but the ovector that  is  passed
2860         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2861         PCRE  gets  a  block of memory at the start of matching to use for this         PCRE gets a block of memory at the start of matching to  use  for  this
2862         purpose. If the call via pcre_malloc() fails, this error is given.  The         purpose.  If the call via pcre_malloc() fails, this error is given. The
2863         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
2864    
2865         This  error  is also given if pcre_stack_malloc() fails in pcre_exec().         This error is also given if pcre_stack_malloc() fails  in  pcre_exec().
2866         This can happen only when PCRE has been compiled with  --disable-stack-         This  can happen only when PCRE has been compiled with --disable-stack-
2867         for-recursion.         for-recursion.
2868    
2869           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2870    
2871         This  error is used by the pcre_copy_substring(), pcre_get_substring(),         This error is used by the pcre_copy_substring(),  pcre_get_substring(),
2872         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2873         returned by pcre_exec().         returned by pcre_exec().
2874    
2875           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2876    
2877         The  backtracking  limit,  as  specified  by the match_limit field in a         The backtracking limit, as specified by  the  match_limit  field  in  a
2878         pcre_extra structure (or defaulted) was reached.  See  the  description         pcre_extra  structure  (or  defaulted) was reached. See the description
2879         above.         above.
2880    
2881           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2882    
2883         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2884         use by callout functions that want to yield a distinctive  error  code.         use  by  callout functions that want to yield a distinctive error code.
2885         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2886    
2887           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2888    
2889         A  string  that contains an invalid UTF-8 byte sequence was passed as a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2890         subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of         subject,  and the PCRE_NO_UTF8_CHECK option was not set. If the size of
2891         the  output  vector  (ovecsize)  is  at least 2, the byte offset to the         the output vector (ovecsize) is at least 2,  the  byte  offset  to  the
2892         start of the the invalid UTF-8 character is placed in  the  first  ele-         start  of  the  the invalid UTF-8 character is placed in the first ele-
2893         ment,  and  a  reason  code is placed in the second element. The reason         ment, and a reason code is placed in the  second  element.  The  reason
2894         codes are listed in the following section.  For backward compatibility,         codes are listed in the following section.  For backward compatibility,
2895         if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-         if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8  char-
2896         acter  at  the  end  of  the   subject   (reason   codes   1   to   5),         acter   at   the   end   of   the   subject  (reason  codes  1  to  5),
2897         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2898    
2899           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2900    
2901         The  UTF-8  byte  sequence that was passed as a subject was checked and         The UTF-8 byte sequence that was passed as a subject  was  checked  and
2902         found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the         found  to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
2903         value  of startoffset did not point to the beginning of a UTF-8 charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2904         ter or the end of the subject.         ter or the end of the subject.
2905    
2906           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2907    
2908         The subject string did not match, but it did match partially.  See  the         The  subject  string did not match, but it did match partially. See the
2909         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2910    
2911           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2912    
2913         This  code  is  no  longer  in  use.  It was formerly returned when the         This code is no longer in  use.  It  was  formerly  returned  when  the
2914         PCRE_PARTIAL option was used with a compiled pattern  containing  items         PCRE_PARTIAL  option  was used with a compiled pattern containing items
2915         that  were  not  supported  for  partial  matching.  From  release 8.00         that were  not  supported  for  partial  matching.  From  release  8.00
2916         onwards, there are no restrictions on partial matching.         onwards, there are no restrictions on partial matching.
2917    
2918           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2919    
2920         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
2921         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2922    
2923           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
# Line 2514  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2927  MATCHING A PATTERN: THE TRADITIONAL FUNC
2927           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2928    
2929         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2930         field in a pcre_extra structure (or defaulted)  was  reached.  See  the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2931         description above.         description above.
2932    
2933           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2528  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2941  MATCHING A PATTERN: THE TRADITIONAL FUNC
2941    
2942           PCRE_ERROR_SHORTUTF8      (-25)           PCRE_ERROR_SHORTUTF8      (-25)
2943    
2944         This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject         This  error  is returned instead of PCRE_ERROR_BADUTF8 when the subject
2945         string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD         string ends with a truncated UTF-8 character and the  PCRE_PARTIAL_HARD
2946         option is set.  Information  about  the  failure  is  returned  as  for         option  is  set.   Information  about  the  failure  is returned as for
2947         PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but         PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this  case,  but
2948         this special error code for PCRE_PARTIAL_HARD precedes the  implementa-         this  special error code for PCRE_PARTIAL_HARD precedes the implementa-
2949         tion  of returned information; it is retained for backwards compatibil-         tion of returned information; it is retained for backwards  compatibil-
2950         ity.         ity.
2951    
2952           PCRE_ERROR_RECURSELOOP    (-26)           PCRE_ERROR_RECURSELOOP    (-26)
2953    
2954         This error is returned when pcre_exec() detects a recursion loop within         This error is returned when pcre_exec() detects a recursion loop within
2955         the  pattern. Specifically, it means that either the whole pattern or a         the pattern. Specifically, it means that either the whole pattern or  a
2956         subpattern has been called recursively for the second time at the  same         subpattern  has been called recursively for the second time at the same
2957         position in the subject string. Some simple patterns that might do this         position in the subject string. Some simple patterns that might do this
2958         are detected and faulted at compile time, but more  complicated  cases,         are  detected  and faulted at compile time, but more complicated cases,
2959         in particular mutual recursions between two different subpatterns, can-         in particular mutual recursions between two different subpatterns, can-
2960         not be detected until run time.         not be detected until run time.
2961    
2962           PCRE_ERROR_JIT_STACKLIMIT (-27)           PCRE_ERROR_JIT_STACKLIMIT (-27)
2963    
2964         This error is returned when a pattern  that  was  successfully  studied         This  error  is  returned  when a pattern that was successfully studied
2965         using  the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem-         using the PCRE_STUDY_JIT_COMPILE option is being matched, but the  mem-
2966         ory available for  the  just-in-time  processing  stack  is  not  large         ory  available  for  the  just-in-time  processing  stack  is not large
2967         enough. See the pcrejit documentation for more details.         enough. See the pcrejit documentation for more details.
2968    
2969             PCRE_ERROR_BADMODE (-28)
2970    
2971           This error is given if a pattern that was compiled by the 8-bit library
2972           is passed to a 16-bit library function, or vice versa.
2973    
2974             PCRE_ERROR_BADENDIANNESS (-29)
2975    
2976           This  error  is  given  if  a  pattern  that  was compiled and saved is
2977           reloaded on a host with  different  endianness.  The  utility  function
2978           pcre_pattern_to_host_byte_order() can be used to convert such a pattern
2979           so that it runs on the new host.
2980    
2981         Error numbers -16 to -20 and -22 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2982    
2983     Reason codes for invalid UTF-8 strings     Reason codes for invalid UTF-8 strings
2984    
2985           This section applies only  to  the  8-bit  library.  The  corresponding
2986           information for the 16-bit library is given in the pcre16 page.
2987    
2988         When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-         When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
2989         UTF8, and the size of the output vector (ovecsize) is at least  2,  the         UTF8, and the size of the output vector (ovecsize) is at least  2,  the
2990         offset  of  the  start  of the invalid UTF-8 character is placed in the         offset  of  the  start  of the invalid UTF-8 character is placed in the
# Line 2991  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 3419  MATCHING A PATTERN: THE ALTERNATIVE FUNC
3419    
3420  SEE ALSO  SEE ALSO
3421    
3422         pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-         pcre16(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),   pcrematch-
3423         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).         ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3),
3424           pcrestack(3).
3425    
3426    
3427  AUTHOR  AUTHOR
# Line 3004  AUTHOR Line 3433  AUTHOR
3433    
3434  REVISION  REVISION
3435    
3436         Last updated: 02 December 2011         Last updated: 07 January 2012
3437         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
3438  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3439    
3440    
3441  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
3442    
3443    
# Line 3020  PCRE CALLOUTS Line 3449  PCRE CALLOUTS
3449    
3450         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
3451    
3452           int (*pcre16_callout)(pcre16_callout_block *);
3453    
3454         PCRE provides a feature called "callout", which is a means of temporar-         PCRE provides a feature called "callout", which is a means of temporar-
3455         ily passing control to the caller of PCRE  in  the  middle  of  pattern         ily passing control to the caller of PCRE  in  the  middle  of  pattern
3456         matching.  The  caller of PCRE provides an external function by putting         matching.  The  caller of PCRE provides an external function by putting
3457         its entry point in the global variable pcre_callout. By  default,  this         its entry point in the global variable pcre_callout (pcre16_callout for
3458         variable contains NULL, which disables all calling out.         the  16-bit  library).  By  default, this variable contains NULL, which
3459           disables all calling out.
3460    
3461         Within  a  regular  expression,  (?C) indicates the points at which the         Within a regular expression, (?C) indicates the  points  at  which  the
3462         external function is to be called.  Different  callout  points  can  be         external  function  is  to  be  called. Different callout points can be
3463         identified  by  putting  a number less than 256 after the letter C. The         identified by putting a number less than 256 after the  letter  C.  The
3464         default value is zero.  For  example,  this  pattern  has  two  callout         default  value  is  zero.   For  example,  this pattern has two callout
3465         points:         points:
3466    
3467           (?C1)abc(?C2)def           (?C1)abc(?C2)def
3468    
3469         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() or         If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
3470         pcre_compile2() is called, PCRE  automatically  inserts  callouts,  all         PCRE  automatically  inserts callouts, all with number 255, before each
3471         with  number  255,  before  each  item  in the pattern. For example, if         item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
3472         PCRE_AUTO_CALLOUT is used with the pattern         pattern
3473    
3474           A(\d{2}|--)           A(\d{2}|--)
3475    
# Line 3045  PCRE CALLOUTS Line 3477  PCRE CALLOUTS
3477    
3478         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
3479    
3480         Notice that there is a callout before and after  each  parenthesis  and         Notice  that  there  is a callout before and after each parenthesis and
3481         alternation  bar.  Automatic  callouts  can  be  used  for tracking the         alternation bar. Automatic  callouts  can  be  used  for  tracking  the
3482         progress of pattern matching. The pcretest command has an  option  that         progress  of  pattern matching. The pcretest command has an option that
3483         sets  automatic callouts; when it is used, the output indicates how the         sets automatic callouts; when it is used, the output indicates how  the
3484         pattern is matched. This is useful information when you are  trying  to         pattern  is  matched. This is useful information when you are trying to
3485         optimize the performance of a particular pattern.         optimize the performance of a particular pattern.
3486    
3487         The  use  of callouts in a pattern makes it ineligible for optimization         The use of callouts in a pattern makes it ineligible  for  optimization
3488         by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the         by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the
3489         PCRE_STUDY_JIT_COMPILE option always fails.         PCRE_STUDY_JIT_COMPILE option always fails.
3490    
3491    
3492  MISSING CALLOUTS  MISSING CALLOUTS
3493    
3494         You  should  be  aware  that,  because of optimizations in the way PCRE         You should be aware that, because of  optimizations  in  the  way  PCRE
3495         matches patterns by default, callouts  sometimes  do  not  happen.  For         matches  patterns  by  default,  callouts  sometimes do not happen. For
3496         example, if the pattern is         example, if the pattern is
3497    
3498           ab(?C4)cd           ab(?C4)cd
3499    
3500         PCRE knows that any matching string must contain the letter "d". If the         PCRE knows that any matching string must contain the letter "d". If the
3501         subject string is "abyz", the lack of "d" means that  matching  doesn't         subject  string  is "abyz", the lack of "d" means that matching doesn't
3502         ever  start,  and  the  callout is never reached. However, with "abyd",         ever start, and the callout is never  reached.  However,  with  "abyd",
3503         though the result is still no match, the callout is obeyed.         though the result is still no match, the callout is obeyed.
3504    
3505         If the pattern is studied, PCRE knows the minimum length of a  matching         If  the pattern is studied, PCRE knows the minimum length of a matching
3506         string,  and will immediately give a "no match" return without actually         string, and will immediately give a "no match" return without  actually
3507         running a match if the subject is not long enough, or,  for  unanchored         running  a  match if the subject is not long enough, or, for unanchored
3508         patterns, if it has been scanned far enough.         patterns, if it has been scanned far enough.
3509    
3510         You  can disable these optimizations by passing the PCRE_NO_START_OPTI-         You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
3511         MIZE option to pcre_compile(), pcre_exec(), or pcre_dfa_exec(),  or  by         MIZE  option  to the matching function, or by starting the pattern with
3512         starting the pattern with (*NO_START_OPT). This slows down the matching         (*NO_START_OPT). This slows down the matching process, but does  ensure
3513         process, but does ensure that callouts such as the  example  above  are         that callouts such as the example above are obeyed.
        obeyed.  
3514    
3515    
3516  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
3517    
3518         During  matching, when PCRE reaches a callout point, the external func-         During  matching, when PCRE reaches a callout point, the external func-
3519         tion defined by pcre_callout is called (if it is set). This applies  to         tion defined by pcre_callout or pcre16_callout  is  called  (if  it  is
3520         both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The         set).   This applies to both normal and DFA matching. The only argument
3521         only argument to the callout function is a pointer  to  a  pcre_callout         to the callout function is a pointer to a pcre_callout or  pcre16_call-
3522         block. This structure contains the following fields:         out block.  These structures contains the following fields:
3523    
3524           int         version;           int           version;
3525           int         callout_number;           int           callout_number;
3526           int        *offset_vector;           int          *offset_vector;
3527           const char *subject;           const char   *subject;           (8-bit version)
3528           int         subject_length;           PCRE_SPTR16   subject;           (16-bit version)
3529           int         start_match;           int           subject_length;
3530           int         current_position;           int           start_match;
3531           int         capture_top;           int           current_position;
3532           int         capture_last;           int           capture_top;
3533           void       *callout_data;           int           capture_last;
3534           int         pattern_position;           void         *callout_data;
3535           int         next_item_length;           int           pattern_position;
3536           const unsigned char *mark;           int           next_item_length;
3537             const unsigned char *mark;       (8-bit version)
3538             const PCRE_UCHAR16  *mark;       (16-bit version)
3539    
3540         The  version  field  is an integer containing the version number of the         The  version  field  is an integer containing the version number of the
3541         block format. The initial version was 0; the current version is 2.  The         block format. The initial version was 0; the current version is 2.  The
# Line 3114  THE CALLOUT INTERFACE Line 3547  THE CALLOUT INTERFACE
3547         outs, and 255 for automatically generated callouts).         outs, and 255 for automatically generated callouts).
3548    
3549         The offset_vector field is a pointer to the vector of offsets that  was         The offset_vector field is a pointer to the vector of offsets that  was
3550         passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When         passed  by  the  caller  to  the matching function. When pcre_exec() or
3551         pcre_exec() is used, the contents can be inspected in order to  extract         pcre16_exec() is used, the contents  can  be  inspected,  in  order  to
3552         substrings  that  have  been  matched  so  far,  in the same way as for         extract  substrings  that  have been matched so far, in the same way as
3553         extracting substrings after a match has completed. For  pcre_dfa_exec()         for extracting substrings after a match  has  completed.  For  the  DFA
3554         this field is not useful.         matching functions, this field is not useful.
3555    
3556         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
3557         were passed to pcre_exec().         were passed to the matching function.
3558    
3559         The start_match field normally contains the offset within  the  subject         The start_match field normally contains the offset within  the  subject
3560         at  which  the  current  match  attempt started. However, if the escape         at  which  the  current  match  attempt started. However, if the escape
# Line 3133  THE CALLOUT INTERFACE Line 3566  THE CALLOUT INTERFACE
3566         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
3567         the current match pointer.         the current match pointer.
3568    
3569         When the pcre_exec() function is used, the capture_top  field  contains         When the pcre_exec() or pcre16_exec() is used,  the  capture_top  field
3570         one  more than the number of the highest numbered captured substring so         contains one more than the number of the highest numbered captured sub-
3571         far. If no substrings have been captured, the value of  capture_top  is         string so far. If no substrings have been captured, the value  of  cap-
3572         one.  This  is always the case when pcre_dfa_exec() is used, because it         ture_top  is  one.  This  is always the case when the DFA functions are
3573         does not support captured substrings.         used, because they do not support captured substrings.
3574    
3575         The capture_last field contains the number of the  most  recently  cap-         The capture_last field contains the number of the  most  recently  cap-
3576         tured  substring. If no substrings have been captured, its value is -1.         tured  substring. If no substrings have been captured, its value is -1.
3577         This is always the case when pcre_dfa_exec() is used.         This is always the case for the DFA matching functions.
3578    
3579         The callout_data field contains a value that is passed  to  pcre_exec()         The callout_data field contains a value that is passed  to  a  matching
3580         or  pcre_dfa_exec() specifically so that it can be passed back in call-         function  specifically so that it can be passed back in callouts. It is
3581         outs. It is passed in the pcre_callout field  of  the  pcre_extra  data         passed in the callout_data field of a pcre_extra or  pcre16_extra  data
3582         structure.  If  no such data was passed, the value of callout_data in a         structure.  If  no such data was passed, the value of callout_data in a
3583         pcre_callout block is NULL. There is a description  of  the  pcre_extra         callout block is NULL. There is a description of the pcre_extra  struc-
3584         structure in the pcreapi documentation.         ture in the pcreapi documentation.
3585    
3586         The  pattern_position field is present from version 1 of the pcre_call-         The  pattern_position  field  is  present from version 1 of the callout
3587         out structure. It contains the offset to the next item to be matched in         structure. It contains the offset to the next item to be matched in the
3588         the pattern string.         pattern string.
3589    
3590         The  next_item_length field is present from version 1 of the pcre_call-         The  next_item_length  field  is  present from version 1 of the callout
3591         out structure. It contains the length of the next item to be matched in         structure. It contains the length of the next item to be matched in the
3592         the  pattern  string. When the callout immediately precedes an alterna-         pattern  string.  When  the callout immediately precedes an alternation
3593         tion bar, a closing parenthesis, or the end of the pattern, the  length         bar, a closing parenthesis, or the end of the pattern,  the  length  is
3594         is  zero.  When the callout precedes an opening parenthesis, the length         zero.  When  the callout precedes an opening parenthesis, the length is
3595         is that of the entire subpattern.         that of the entire subpattern.
3596    
3597         The pattern_position and next_item_length fields are intended  to  help         The pattern_position and next_item_length fields are intended  to  help
3598         in  distinguishing between different automatic callouts, which all have         in  distinguishing between different automatic callouts, which all have
3599         the same callout number. However, they are set for all callouts.         the same callout number. However, they are set for all callouts.
3600    
3601         The mark field is present from version 2 of the pcre_callout structure.         The mark field is present from version 2 of the callout  structure.  In
3602         In  callouts  from pcre_exec() it contains a pointer to the zero-termi-         callouts from pcre_exec() or pcre16_exec() it contains a pointer to the
3603         nated name of the most recently passed (*MARK),  (*PRUNE),  or  (*THEN)         zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
3604         item in the match, or NULL if no such items have been passed. Instances         (*THEN)  item  in the match, or NULL if no such items have been passed.
3605         of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a  previous         Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
3606         (*MARK).  In  callouts  from pcre_dfa_exec() this field always contains         previous  (*MARK).  In  callouts  from  the DFA matching functions this
3607         NULL.         field always contains NULL.
3608    
3609    
3610  RETURN VALUES  RETURN VALUES
# Line 3180  RETURN VALUES Line 3613  RETURN VALUES
3613         is  zero,  matching  proceeds  as  normal. If the value is greater than         is  zero,  matching  proceeds  as  normal. If the value is greater than
3614         zero, matching fails at the current point, but  the  testing  of  other         zero, matching fails at the current point, but  the  testing  of  other
3615         matching possibilities goes ahead, just as if a lookahead assertion had         matching possibilities goes ahead, just as if a lookahead assertion had
3616         failed. If the value is less than zero, the  match  is  abandoned,  and         failed. If the value is less than zero, the  match  is  abandoned,  the
3617         pcre_exec() or pcre_dfa_exec() returns the negative value.         matching function returns the negative value.
3618    
3619         Negative   values   should   normally   be   chosen  from  the  set  of         Negative   values   should   normally   be   chosen  from  the  set  of
3620         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
# Line 3199  AUTHOR Line 3632  AUTHOR
3632    
3633  REVISION  REVISION
3634    
3635         Last updated: 30 November 2011         Last updated: 08 Janurary 2012
3636         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
3637  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3638    
3639    
3640  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
3641    
3642    
# Line 3217  DIFFERENCES BETWEEN PCRE AND PERL Line 3650  DIFFERENCES BETWEEN PCRE AND PERL
3650         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
3651         respect to Perl versions 5.10 and above.         respect to Perl versions 5.10 and above.
3652    
3653         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1. PCRE has only a subset of Perl's Unicode support. Details of what it
3654         of what it does have are given in the pcreunicode page.         does have are given in the pcreunicode page.
3655    
3656         2. PCRE allows repeat quantifiers only on parenthesized assertions, but         2. PCRE allows repeat quantifiers only on parenthesized assertions, but
3657         they  do  not mean what you might think. For example, (?!a){3} does not         they  do  not mean what you might think. For example, (?!a){3} does not
# Line 3356  DIFFERENCES BETWEEN PCRE AND PERL Line 3789  DIFFERENCES BETWEEN PCRE AND PERL
3789         even on different hosts that have the other endianness.  However,  this         even on different hosts that have the other endianness.  However,  this
3790         does not apply to optimized data created by the just-in-time compiler.         does not apply to optimized data created by the just-in-time compiler.
3791    
3792         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a         (k)   The   alternative   matching   functions   (pcre_dfa_exec()   and
3793         different way and is not Perl-compatible.         pcre16_dfa_exec()) match in a different way and are  not  Perl-compati-
3794           ble.
3795    
3796         (l) PCRE recognizes some special sequences such as (*CR) at  the  start         (l)  PCRE  recognizes some special sequences such as (*CR) at the start
3797         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
3798         pattern.         pattern.
3799    
# Line 3373  AUTHOR Line 3807  AUTHOR
3807    
3808  REVISION  REVISION
3809    
3810         Last updated: 14 November 2011         Last updated: 08 Januray 2012
3811         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
3812  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3813    
3814    
3815  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
3816    
3817    
# Line 3403  PCRE REGULAR EXPRESSION DETAILS Line 3837  PCRE REGULAR EXPRESSION DETAILS
3837         intended as reference material.         intended as reference material.
3838    
3839         The original operation of PCRE was on strings of  one-byte  characters.         The original operation of PCRE was on strings of  one-byte  characters.
3840         However,  there is now also support for UTF-8 character strings. To use         However,  there  is  now also support for UTF-8 strings in the original
3841         this, PCRE must be built to include UTF-8 support, and  you  must  call         library, and a second library that supports 16-bit and UTF-16 character
3842         pcre_compile()  or  pcre_compile2() with the PCRE_UTF8 option. There is         strings. To use these features, PCRE must be built to include appropri-
3843         also a special sequence that can be given at the start of a pattern:         ate support. When using UTF strings you must either call the  compiling
3844           function  with  the PCRE_UTF8 or PCRE_UTF16 option, or the pattern must
3845           start with one of these special sequences:
3846    
3847           (*UTF8)           (*UTF8)
3848             (*UTF16)
3849    
3850         Starting a pattern with this sequence  is  equivalent  to  setting  the         Starting a pattern with such a sequence is equivalent  to  setting  the
3851         PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting         relevant option. This feature is not Perl-compatible. How setting a UTF
3852         UTF-8 mode affects pattern matching  is  mentioned  in  several  places         mode affects pattern matching is mentioned  in  several  places  below.
3853         below.  There  is  also  a summary of UTF-8 features in the pcreunicode         There is also a summary of features in the pcreunicode page.
        page.  
3854    
3855         Another special sequence that may appear at the start of a  pattern  or         Another  special  sequence that may appear at the start of a pattern or
3856         in combination with (*UTF8) is:         in combination with (*UTF8) or (*UTF16) is:
3857    
3858           (*UCP)           (*UCP)
3859    
3860         This  has  the  same  effect  as setting the PCRE_UCP option: it causes         This has the same effect as setting  the  PCRE_UCP  option:  it  causes
3861         sequences such as \d and \w to  use  Unicode  properties  to  determine         sequences  such  as  \d  and  \w to use Unicode properties to determine
3862         character types, instead of recognizing only characters with codes less         character types, instead of recognizing only characters with codes less
3863         than 128 via a lookup table.         than 128 via a lookup table.
3864    
3865         If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as         If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
3866         setting the PCRE_NO_START_OPTIMIZE option either at compile or matching         setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
3867         time. There are also some more of these special sequences that are con-         time. There are also some more of these special sequences that are con-
3868         cerned with the handling of newlines; they are described below.         cerned with the handling of newlines; they are described below.
3869    
3870         The  remainder  of  this  document discusses the patterns that are sup-         The remainder of this document discusses the  patterns  that  are  sup-
3871         ported by PCRE when its main matching function, pcre_exec(),  is  used.         ported  by  PCRE  when  one  its  main  matching functions, pcre_exec()
3872         From   release   6.0,   PCRE   offers   a   second  matching  function,         (8-bit) or pcre16_exec() (16-bit), is used. PCRE also  has  alternative
3873         pcre_dfa_exec(), which matches using a different algorithm that is  not         matching  functions, pcre_dfa_exec() and pcre16_dfa_exec(), which match
3874         Perl-compatible. Some of the features discussed below are not available         using a different algorithm that is not Perl-compatible.  Some  of  the
3875         when pcre_dfa_exec() is used. The advantages and disadvantages  of  the         features  discussed  below are not available when DFA matching is used.
3876         alternative  function, and how it differs from the normal function, are         The advantages and disadvantages of the alternative functions, and  how
3877         discussed in the pcrematching page.         they  differ from the normal functions, are discussed in the pcrematch-
3878           ing page.
3879    
3880    
3881  NEWLINE CONVENTIONS  NEWLINE CONVENTIONS
# Line 3459  NEWLINE CONVENTIONS Line 3896  NEWLINE CONVENTIONS
3896           (*ANYCRLF)   any of the three above           (*ANYCRLF)   any of the three above
3897           (*ANY)       all Unicode newline sequences           (*ANY)       all Unicode newline sequences
3898    
3899         These override the default and the options given to  pcre_compile()  or         These override the default and the options given to the compiling func-
3900         pcre_compile2().  For example, on a Unix system where LF is the default         tion.  For  example,  on  a Unix system where LF is the default newline
3901         newline sequence, the pattern         sequence, the pattern
3902    
3903           (*CR)a.b           (*CR)a.b
3904    
# Line 3491  CHARACTERS AND METACHARACTERS Line 3928  CHARACTERS AND METACHARACTERS
3928    
3929         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
3930         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless matching is specified (the PCRE_CASELESS option), letters  are
3931         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched  independently  of case. In a UTF mode, PCRE always understands
3932         the concept of case for characters whose values are less than  128,  so         the concept of case for characters whose values are less than  128,  so
3933         caseless  matching  is always possible. For characters with higher val-         caseless  matching  is always possible. For characters with higher val-
3934         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues, the concept of case is supported if PCRE is compiled with  Unicode
3935         property  support,  but  not  otherwise.   If  you want to use caseless         property  support,  but  not  otherwise.   If  you want to use caseless
3936         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching for characters 128 and above, you must  ensure  that  PCRE  is
3937         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF support.
3938    
3939         The  power  of  regular  expressions  comes from the ability to include         The  power  of  regular  expressions  comes from the ability to include
3940         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives and repetitions in the pattern. These are encoded  in  the
# Line 3552  BACKSLASH Line 3989  BACKSLASH
3989         that  it stands for itself. In particular, if you want to match a back-         that  it stands for itself. In particular, if you want to match a back-
3990         slash, you write \\.         slash, you write \\.
3991    
3992         In UTF-8 mode, only ASCII numbers and letters have any special  meaning         In a UTF mode, only ASCII numbers and letters have any special  meaning
3993         after  a  backslash.  All  other characters (in particular, those whose         after  a  backslash.  All  other characters (in particular, those whose
3994         codepoints are greater than 127) are treated as literals.         codepoints are greater than 127) are treated as literals.
3995    
# Line 3608  BACKSLASH Line 4045  BACKSLASH
4045         inverted.  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({         inverted.  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({
4046         is  7B),  while  \c; becomes hex 7B (; is 3B). If the byte following \c         is  7B),  while  \c; becomes hex 7B (; is 3B). If the byte following \c
4047         has a value greater than 127, a compile-time error occurs.  This  locks         has a value greater than 127, a compile-time error occurs.  This  locks
4048         out  non-ASCII  characters in both byte mode and UTF-8 mode. (When PCRE         out non-ASCII characters in all modes. (When PCRE is compiled in EBCDIC
4049         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case         mode, all byte values are valid. A lower case letter  is  converted  to
4050         letter is converted to upper case, and then the 0xc0 bits are flipped.)         upper case, and then the 0xc0 bits are flipped.)
4051    
4052         By  default,  after  \x,  from  zero to two hexadecimal digits are read         By  default,  after  \x,  from  zero to two hexadecimal digits are read
4053         (letters can be in upper or lower case). Any number of hexadecimal dig-         (letters can be in upper or lower case). Any number of hexadecimal dig-
4054         its  may  appear between \x{ and }, but the value of the character code         its may appear between \x{ and }, but the character code is constrained
4055         must be less than 256 in non-UTF-8 mode, and less than 2**31  in  UTF-8         as follows:
4056         mode.  That is, the maximum value in hexadecimal is 7FFFFFFF. Note that  
4057         this is bigger than the largest Unicode code point, which is 10FFFF.           8-bit non-UTF mode    less than 0x100
4058             8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
4059             16-bit non-UTF mode   less than 0x10000
4060             16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
4061    
4062         If characters other than hexadecimal digits appear between \x{  and  },         Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
4063           called "surrogate" codepoints).
4064    
4065           If  characters  other than hexadecimal digits appear between \x{ and },
4066         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
4067         Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal         Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
4068         escape,  with  no  following  digits, giving a character whose value is         escape, with no following digits, giving a  character  whose  value  is
4069         zero.         zero.
4070    
4071         If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x         If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
4072         is  as  just described only when it is followed by two hexadecimal dig-         is as just described only when it is followed by two  hexadecimal  dig-
4073         its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript         its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
4074         mode, support for code points greater than 256 is provided by \u, which         mode, support for code points greater than 256 is provided by \u, which
4075         must be followed by four hexadecimal digits;  otherwise  it  matches  a         must  be  followed  by  four hexadecimal digits; otherwise it matches a
4076         literal "u" character.         literal "u" character.
4077    
4078         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
4079         two syntaxes for \x (or by \u in JavaScript mode). There is no  differ-         two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
4080         ence in the way they are handled. For example, \xdc is exactly the same         ence in the way they are handled. For example, \xdc is exactly the same
4081         as \x{dc} (or \u00dc in JavaScript mode).         as \x{dc} (or \u00dc in JavaScript mode).
4082    
4083         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
4084         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
4085         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
4086         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
4087         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
4088    
4089         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
4090         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
4091         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
4092         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
4093         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
4094         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
4095         of parenthesized subpatterns.         of parenthesized subpatterns.
4096    
4097         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
4098         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
4099         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
4100         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves. The
4101         non-UTF-8 mode, the value of a character specified  in  octal  must  be         value  of  the  character  is constrained in the same way as characters
4102         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         specified in hexadecimal.  For example:
        example:  
4103    
4104           \040   is another way of writing a space           \040   is another way of writing a space
4105           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 3670  BACKSLASH Line 4112  BACKSLASH
4112           \113   might be a back reference, otherwise the           \113   might be a back reference, otherwise the
4113                     character with octal code 113                     character with octal code 113
4114           \377   might be a back reference, otherwise           \377   might be a back reference, otherwise
4115                     the byte consisting entirely of 1 bits                     the value 255 (decimal)
4116           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
4117                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
4118    
# Line 3755  BACKSLASH Line 4197  BACKSLASH
4197         are  used  for  accented letters, and these are then matched by \w. The         are  used  for  accented letters, and these are then matched by \w. The
4198         use of locales with Unicode is discouraged.         use of locales with Unicode is discouraged.
4199    
4200         By default, in UTF-8 mode, characters  with  values  greater  than  128         By default, in a UTF mode, characters  with  values  greater  than  128
4201         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
4202         sequences retain their original meanings from before UTF-8 support  was         sequences retain their original meanings from before  UTF  support  was
4203         available,  mainly for efficiency reasons. However, if PCRE is compiled         available,  mainly for efficiency reasons. However, if PCRE is compiled
4204         with Unicode property support, and the PCRE_UCP option is set, the  be-         with Unicode property support, and the PCRE_UCP option is set, the  be-
4205         haviour  is  changed  so  that Unicode properties are used to determine         haviour  is  changed  so  that Unicode properties are used to determine
# Line 3776  BACKSLASH Line 4218  BACKSLASH
4218         The sequences \h, \H, \v, and \V are features that were added  to  Perl         The sequences \h, \H, \v, and \V are features that were added  to  Perl
4219         at  release  5.10. In contrast to the other sequences, which match only         at  release  5.10. In contrast to the other sequences, which match only
4220         ASCII characters by default, these  always  match  certain  high-valued         ASCII characters by default, these  always  match  certain  high-valued
4221         codepoints  in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-         codepoints,  whether or not PCRE_UCP is set. The horizontal space char-
4222         tal space characters are:         acters are:
4223    
4224           U+0009     Horizontal tab           U+0009     Horizontal tab
4225           U+0020     Space           U+0020     Space
# Line 3809  BACKSLASH Line 4251  BACKSLASH
4251           U+2028     Line separator           U+2028     Line separator
4252           U+2029     Paragraph separator           U+2029     Paragraph separator
4253    
4254           In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
4255           256 are relevant.
4256    
4257     Newline sequences     Newline sequences
4258    
4259         Outside a character class, by default, the escape sequence  \R  matches         Outside  a  character class, by default, the escape sequence \R matches
4260         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the         any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
4261         following:         to the following:
4262    
4263           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
4264    
4265         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
4266         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
4267         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
4268         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
4269         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
4270         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
4271    
4272         In  UTF-8  mode, two additional characters whose codepoints are greater         In other modes, two additional characters whose codepoints are  greater
4273         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
4274         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
4275         these characters to be recognized.         these characters to be recognized.
4276    
4277         It is possible to restrict \R to match only CR, LF, or CRLF (instead of         It is possible to restrict \R to match only CR, LF, or CRLF (instead of
4278         the  complete  set  of  Unicode  line  endings)  by  setting the option         the complete set  of  Unicode  line  endings)  by  setting  the  option
4279         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
4280         (BSR is an abbrevation for "backslash R".) This can be made the default         (BSR is an abbrevation for "backslash R".) This can be made the default
4281         when PCRE is built; if this is the case, the  other  behaviour  can  be         when  PCRE  is  built;  if this is the case, the other behaviour can be
4282         requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to         requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
4283         specify these settings by starting a pattern string  with  one  of  the         specify  these  settings  by  starting a pattern string with one of the
4284         following sequences:         following sequences:
4285    
4286           (*BSR_ANYCRLF)   CR, LF, or CRLF only           (*BSR_ANYCRLF)   CR, LF, or CRLF only
4287           (*BSR_UNICODE)   any Unicode newline sequence           (*BSR_UNICODE)   any Unicode newline sequence
4288    
4289         These  override  the default and the options given to pcre_compile() or         These override the default and the options given to the compiling func-
4290         pcre_compile2(), but  they  can  be  overridden  by  options  given  to         tion,  but  they  can  themselves  be  overridden by options given to a
4291         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which         matching function. Note that these  special  settings,  which  are  not
4292         are not Perl-compatible, are recognized only at the  very  start  of  a         Perl-compatible,  are  recognized  only at the very start of a pattern,
4293         pattern,  and that they must be in upper case. If more than one of them         and that they must be in upper case.  If  more  than  one  of  them  is
4294         is present, the last one is used. They can be combined with a change of         present,  the  last  one is used. They can be combined with a change of
4295         newline convention; for example, a pattern can start with:         newline convention; for example, a pattern can start with:
4296    
4297           (*ANY)(*BSR_ANYCRLF)           (*ANY)(*BSR_ANYCRLF)
4298    
4299         They can also be combined with the (*UTF8) or (*UCP) special sequences.         They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special
4300         Inside a character class, \R  is  treated  as  an  unrecognized  escape         sequences.  Inside  a character class, \R is treated as an unrecognized
4301         sequence, and so matches the letter "R" by default, but causes an error         escape sequence, and so matches the letter "R" by default,  but  causes
4302         if PCRE_EXTRA is set.         an error if PCRE_EXTRA is set.
4303    
4304     Unicode character properties     Unicode character properties
4305    
4306         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
4307         tional  escape sequences that match characters with specific properties         tional escape sequences that match characters with specific  properties
4308         are available.  When not in UTF-8 mode, these sequences are  of  course         are  available.   When  in 8-bit non-UTF-8 mode, these sequences are of
4309         limited  to  testing characters whose codepoints are less than 256, but         course limited to testing characters whose  codepoints  are  less  than
4310         they do work in this mode.  The extra escape sequences are:         256, but they do work in this mode.  The extra escape sequences are:
4311    
4312           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
4313           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
4314           \X       an extended Unicode sequence           \X       an extended Unicode sequence
4315    
4316         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
4317         script names, the general category properties, "Any", which matches any         script names, the general category properties, "Any", which matches any
4318         character  (including  newline),  and  some  special  PCRE   properties         character   (including  newline),  and  some  special  PCRE  properties
4319         (described  in the next section).  Other Perl properties such as "InMu-         (described in the next section).  Other Perl properties such as  "InMu-
4320         sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}         sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
4321         does not match any characters, so always causes a match failure.         does not match any characters, so always causes a match failure.
4322    
4323         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
4324         A character from one of these sets can be matched using a script  name.         A  character from one of these sets can be matched using a script name.
4325         For example:         For example:
4326    
4327           \p{Greek}           \p{Greek}
4328           \P{Han}           \P{Han}
4329    
4330         Those  that are not part of an identified script are lumped together as         Those that are not part of an identified script are lumped together  as
4331         "Common". The current list of scripts is:         "Common". The current list of scripts is:
4332    
4333         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
4334         Buginese,  Buhid,  Canadian_Aboriginal, Carian, Cham, Cherokee, Common,         Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,
4335         Coptic,  Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,   Egyp-         Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp-
4336         tian_Hieroglyphs,   Ethiopic,   Georgian,  Glagolitic,  Gothic,  Greek,         tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,
4337         Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana,  Impe-         Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe-
4338         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
4339         Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer,  Lao,         Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
4340         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
4341         Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham,  Old_Italic,         Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
4342         Old_Persian,  Old_South_Arabian,  Old_Turkic, Ol_Chiki, Oriya, Osmanya,         Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,
4343         Phags_Pa, Phoenician, Rejang, Runic,  Samaritan,  Saurashtra,  Shavian,         Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,
4344         Sinhala,  Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le,         Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,
4345         Tai_Tham, Tai_Viet, Tamil, Telugu,  Thaana,  Thai,  Tibetan,  Tifinagh,         Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,
4346         Ugaritic, Vai, Yi.         Ugaritic, Vai, Yi.
4347    
4348         Each character has exactly one Unicode general category property, spec-         Each character has exactly one Unicode general category property, spec-
4349         ified by a two-letter abbreviation. For compatibility with Perl,  nega-         ified  by a two-letter abbreviation. For compatibility with Perl, nega-
4350         tion  can  be  specified  by including a circumflex between the opening         tion can be specified by including a  circumflex  between  the  opening
4351         brace and the property name.  For  example,  \p{^Lu}  is  the  same  as         brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
4352         \P{Lu}.         \P{Lu}.
4353    
4354         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
4355         eral category properties that start with that letter. In this case,  in         eral  category properties that start with that letter. In this case, in
4356         the  absence of negation, the curly brackets in the escape sequence are         the absence of negation, the curly brackets in the escape sequence  are
4357         optional; these two examples have the same effect:         optional; these two examples have the same effect:
4358    
4359           \p{L}           \p{L}
# Line 3960  BACKSLASH Line 4405  BACKSLASH
4405           Zp    Paragraph separator           Zp    Paragraph separator
4406           Zs    Space separator           Zs    Space separator
4407    
4408         The special property L& is also supported: it matches a character  that         The  special property L& is also supported: it matches a character that
4409         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
4410         classified as a modifier or "other".         classified as a modifier or "other".
4411    
4412         The Cs (Surrogate) property applies only to  characters  in  the  range         The  Cs  (Surrogate)  property  applies only to characters in the range
4413         U+D800  to  U+DFFF. Such characters are not valid in UTF-8 strings (see         U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
4414         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         so  cannot  be  tested  by  PCRE, unless UTF validity checking has been
4415         ing  has  been  turned off (see the discussion of PCRE_NO_UTF8_CHECK in         turned   off   (see   the   discussion   of   PCRE_NO_UTF8_CHECK    and
4416         the pcreapi page). Perl does not support the Cs property.         PCRE_NO_UTF16_CHECK  in the pcreapi page). Perl does not support the Cs
4417           property.
4418    
4419         The long synonyms for  property  names  that  Perl  supports  (such  as         The long synonyms for  property  names  that  Perl  supports  (such  as
4420         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
# Line 3990  BACKSLASH Line 4436  BACKSLASH
4436         by  zero  or  more  characters with the "mark" property, and treats the         by  zero  or  more  characters with the "mark" property, and treats the
4437         sequence as an atomic group (see below).  Characters  with  the  "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
4438         property  are  typically  accents  that affect the preceding character.         property  are  typically  accents  that affect the preceding character.
4439         None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X         None of them have codepoints less than 256, so in 8-bit non-UTF-8  mode
4440         matches any one character.         \X matches any one character.
4441    
4442         Note that recent versions of Perl have changed \X to match what Unicode         Note that recent versions of Perl have changed \X to match what Unicode
4443         calls an "extended grapheme cluster", which has a more complicated def-         calls an "extended grapheme cluster", which has a more complicated def-
# Line 4001  BACKSLASH Line 4447  BACKSLASH
4447         to search a structure that contains  data  for  over  fifteen  thousand         to search a structure that contains  data  for  over  fifteen  thousand
4448         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
4449         \w do not use Unicode properties in PCRE by  default,  though  you  can         \w do not use Unicode properties in PCRE by  default,  though  you  can
4450         make them do so by setting the PCRE_UCP option for pcre_compile() or by         make  them do so by setting the PCRE_UCP option or by starting the pat-
4451         starting the pattern with (*UCP).         tern with (*UCP).
4452    
4453     PCRE's additional properties     PCRE's additional properties
4454    
# Line 4071  BACKSLASH Line 4517  BACKSLASH
4517         A  word  boundary is a position in the subject string where the current         A  word  boundary is a position in the subject string where the current
4518         character and the previous character do not both match \w or  \W  (i.e.         character and the previous character do not both match \w or  \W  (i.e.
4519         one  matches  \w  and the other matches \W), or the start or end of the         one  matches  \w  and the other matches \W), or the start or end of the
4520         string if the first or last  character  matches  \w,  respectively.  In         string if the first or last character matches \w,  respectively.  In  a
4521         UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the         UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
4522         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
4523         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
4524         quence. However, whatever follows \b normally determines which  it  is.         quence. However, whatever follows \b normally determines which  it  is.
# Line 4163  FULL STOP (PERIOD, DOT) AND \N Line 4609  FULL STOP (PERIOD, DOT) AND \N
4609    
4610         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
4611         ter  in  the subject string except (by default) a character that signi-         ter  in  the subject string except (by default) a character that signi-
4612         fies the end of a line. In UTF-8 mode, the  matched  character  may  be         fies the end of a line.
        more than one byte long.  
4613    
4614         When  a line ending is defined as a single character, dot never matches         When a line ending is defined as a single character, dot never  matches
4615         that character; when the two-character sequence CRLF is used, dot  does         that  character; when the two-character sequence CRLF is used, dot does
4616         not  match  CR  if  it  is immediately followed by LF, but otherwise it         not match CR if it is immediately followed  by  LF,  but  otherwise  it
4617         matches all characters (including isolated CRs and LFs). When any  Uni-         matches  all characters (including isolated CRs and LFs). When any Uni-
4618         code  line endings are being recognized, dot does not match CR or LF or         code line endings are being recognized, dot does not match CR or LF  or
4619         any of the other line ending characters.         any of the other line ending characters.
4620    
4621         The behaviour of dot with regard to newlines can  be  changed.  If  the         The  behaviour  of  dot  with regard to newlines can be changed. If the
4622         PCRE_DOTALL  option  is  set,  a dot matches any one character, without         PCRE_DOTALL option is set, a dot matches  any  one  character,  without
4623         exception. If the two-character sequence CRLF is present in the subject         exception. If the two-character sequence CRLF is present in the subject
4624         string, it takes two dots to match it.         string, it takes two dots to match it.
4625    
4626         The  handling of dot is entirely independent of the handling of circum-         The handling of dot is entirely independent of the handling of  circum-
4627         flex and dollar, the only relationship being  that  they  both  involve         flex  and  dollar,  the  only relationship being that they both involve
4628         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
4629    
4630         The  escape  sequence  \N  behaves  like  a  dot, except that it is not         The escape sequence \N behaves like  a  dot,  except  that  it  is  not
4631         affected by the PCRE_DOTALL option. In  other  words,  it  matches  any         affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
4632         character  except  one that signifies the end of a line. Perl also uses         character except one that signifies the end of a line. Perl  also  uses
4633         \N to match characters by name; PCRE does not support this.         \N to match characters by name; PCRE does not support this.
4634    
4635    
4636  MATCHING A SINGLE BYTE  MATCHING A SINGLE DATA UNIT
4637    
4638         Outside a character class, the escape sequence \C matches any one byte,         Outside  a character class, the escape sequence \C matches any one data
4639         both  in  and  out of UTF-8 mode. Unlike a dot, it always matches line-         unit, whether or not a UTF mode is set. In the 8-bit library, one  data
4640         ending characters. The feature is provided in Perl in  order  to  match         unit  is  one byte; in the 16-bit library it is a 16-bit unit. Unlike a
4641         individual  bytes  in UTF-8 mode, but it is unclear how it can usefully         dot, \C always matches line-ending characters. The feature is  provided
4642         be used. Because \C breaks up characters into individual bytes,  match-         in  Perl  in  order  to match individual bytes in UTF-8 mode, but it is
4643         ing  one  byte  with \C in UTF-8 mode means that the rest of the string         unclear how it can usefully be used. Because \C  breaks  up  characters
4644         may start with a malformed UTF-8 character. This has undefined results,         into  individual  data  units,  matching one unit with \C in a UTF mode
4645         because  PCRE  assumes that it is dealing with valid UTF-8 strings (and         means that the rest of the string may start with a malformed UTF  char-
4646         by default it checks  this  at  the  start  of  processing  unless  the         acter.  This  has  undefined  results,  because PCRE assumes that it is
4647         PCRE_NO_UTF8_CHECK option is used).         dealing with valid UTF strings (and by default it checks  this  at  the
4648           start of processing unless the PCRE_NO_UTF8_CHECK option is used).
4649    
4650         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
4651         below) in UTF-8 mode, because this would make it impossible  to  calcu-         below) in a UTF mode, because this would make it impossible  to  calcu-
4652         late the length of the lookbehind.         late the length of the lookbehind.
4653    
4654         In  general, the \C escape sequence is best avoided in UTF-8 mode. How-         In general, the \C escape sequence is best avoided. However, one way of
4655         ever, one way of using it that avoids the problem  of  malformed  UTF-8         using it that avoids the problem of malformed UTF characters is to  use
4656         characters  is to use a lookahead to check the length of the next char-         a  lookahead to check the length of the next character, as in this pat-
4657         acter, as in this pattern (ignore white space and line breaks):         tern, which could be used with a UTF-8 string (ignore white  space  and
4658           line breaks):
4659    
4660           (?| (?=[\x00-\x7f])(\C) |           (?| (?=[\x00-\x7f])(\C) |
4661               (?=[\x80-\x{7ff}])(\C)(\C) |               (?=[\x80-\x{7ff}])(\C)(\C) |
4662               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
4663               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
4664    
4665         A group that starts with (?| resets the capturing  parentheses  numbers         A  group  that starts with (?| resets the capturing parentheses numbers
4666         in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The         in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The
4667         assertions at the start of each branch check the next  UTF-8  character         assertions  at  the start of each branch check the next UTF-8 character
4668         for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The         for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
4669         character's individual bytes are then captured by the appropriate  num-         character's  individual bytes are then captured by the appropriate num-
4670         ber of groups.         ber of groups.
4671    
4672    
# Line 4229  SQUARE BRACKETS AND CHARACTER CLASSES Line 4676  SQUARE BRACKETS AND CHARACTER CLASSES
4676         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
4677         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4678         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
4679         square  bracket  is required as a member of the class, it should be the         square bracket is required as a member of the class, it should  be  the
4680         first data character in the class  (after  an  initial  circumflex,  if         first  data  character  in  the  class (after an initial circumflex, if
4681         present) or escaped with a backslash.         present) or escaped with a backslash.
4682    
4683         A  character  class matches a single character in the subject. In UTF-8         A character class matches a single character in the subject. In  a  UTF
4684         mode, the character may be more than one byte long. A matched character         mode,  the  character  may  be  more than one data unit long. A matched
4685         must be in the set of characters defined by the class, unless the first         character must be in the set of characters defined by the class, unless
4686         character in the class definition is a circumflex, in  which  case  the         the  first  character in the class definition is a circumflex, in which
4687         subject  character  must  not  be in the set defined by the class. If a         case the subject character must not be in the set defined by the class.
4688         circumflex is actually required as a member of the class, ensure it  is         If  a  circumflex is actually required as a member of the class, ensure
4689         not the first character, or escape it with a backslash.         it is not the first character, or escape it with a backslash.
4690    
4691         For  example, the character class [aeiou] matches any lower case vowel,         For example, the character class [aeiou] matches any lower case  vowel,
4692         while [^aeiou] matches any character that is not a  lower  case  vowel.         while  [^aeiou]  matches  any character that is not a lower case vowel.
4693         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
4694         characters that are in the class by enumerating those that are  not.  A         characters  that  are in the class by enumerating those that are not. A
4695         class  that starts with a circumflex is not an assertion; it still con-         class that starts with a circumflex is not an assertion; it still  con-
4696         sumes a character from the subject string, and therefore  it  fails  if         sumes  a  character  from the subject string, and therefore it fails if
4697         the current pointer is at the end of the string.         the current pointer is at the end of the string.
4698    
4699         In  UTF-8 mode, characters with values greater than 255 can be included         In UTF-8  (UTF-16)  mode,  characters  with  values  greater  than  255
4700         in a class as a literal string of bytes, or by using the  \x{  escaping         (0xffff)  can be included in a class as a literal string of data units,
4701         mechanism.         or by using the \x{ escaping mechanism.
4702    
4703         When  caseless  matching  is set, any letters in a class represent both         When caseless matching is set, any letters in a  class  represent  both
4704         their upper case and lower case versions, so for  example,  a  caseless         their  upper  case  and lower case versions, so for example, a caseless
4705         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
4706         match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always         match  "A", whereas a caseful version would. In a UTF mode, PCRE always
4707         understands  the  concept  of case for characters whose values are less         understands the concept of case for characters whose  values  are  less
4708         than 128, so caseless matching is always possible. For characters  with         than  128, so caseless matching is always possible. For characters with
4709         higher  values,  the  concept  of case is supported if PCRE is compiled         higher values, the concept of case is supported  if  PCRE  is  compiled
4710         with Unicode property support, but not otherwise.  If you want  to  use         with  Unicode  property support, but not otherwise.  If you want to use
4711         caseless  matching  in UTF8-mode for characters 128 and above, you must         caseless matching in a UTF mode for characters 128 and above, you  must
4712         ensure that PCRE is compiled with Unicode property support as  well  as         ensure  that  PCRE is compiled with Unicode property support as well as
4713         with UTF-8 support.         with UTF support.
4714    
4715         Characters  that  might  indicate  line breaks are never treated in any         Characters that might indicate line breaks are  never  treated  in  any
4716         special way  when  matching  character  classes,  whatever  line-ending         special  way  when  matching  character  classes,  whatever line-ending
4717         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
4718         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
4719         of these characters.         of these characters.
4720    
4721         The  minus (hyphen) character can be used to specify a range of charac-         The minus (hyphen) character can be used to specify a range of  charac-
4722         ters in a character  class.  For  example,  [d-m]  matches  any  letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
4723         between  d  and  m,  inclusive.  If  a minus character is required in a         between d and m, inclusive. If a  minus  character  is  required  in  a
4724         class, it must be escaped with a backslash  or  appear  in  a  position         class,  it  must  be  escaped  with a backslash or appear in a position
4725         where  it cannot be interpreted as indicating a range, typically as the         where it cannot be interpreted as indicating a range, typically as  the
4726         first or last character in the class.         first or last character in the class.
4727    
4728         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
4729         ter  of a range. A pattern such as [W-]46] is interpreted as a class of         ter of a range. A pattern such as [W-]46] is interpreted as a class  of
4730         two characters ("W" and "-") followed by a literal string "46]", so  it         two  characters ("W" and "-") followed by a literal string "46]", so it
4731         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
4732         backslash it is interpreted as the end of range, so [W-\]46] is  inter-         backslash  it is interpreted as the end of range, so [W-\]46] is inter-
4733         preted  as a class containing a range followed by two other characters.         preted as a class containing a range followed by two other  characters.
4734         The octal or hexadecimal representation of "]" can also be used to  end         The  octal or hexadecimal representation of "]" can also be used to end
4735         a range.         a range.
4736    
4737         Ranges  operate in the collating sequence of character values. They can         Ranges operate in the collating sequence of character values. They  can
4738         also  be  used  for  characters  specified  numerically,  for   example         also   be  used  for  characters  specified  numerically,  for  example
4739         [\000-\037].  In UTF-8 mode, ranges can include characters whose values         [\000-\037]. Ranges can include any characters that are valid  for  the
4740         are greater than 255, for example [\x{100}-\x{2ff}].         current mode.
4741    
4742         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
4743         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
4744         to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
4745         character  tables  for  a French locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
4746         accented E characters in both cases. In UTF-8 mode, PCRE  supports  the         accented  E  characters  in both cases. In UTF modes, PCRE supports the
4747         concept  of  case for characters with values greater than 128 only when         concept of case for characters with values greater than 128  only  when
4748         it is compiled with Unicode property support.         it is compiled with Unicode property support.
4749    
4750         The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,         The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
4751         \w, and \W may appear in a character class, and add the characters that         \w, and \W may appear in a character class, and add the characters that
4752         they match to the class. For example, [\dABCDEF] matches any  hexadeci-         they  match to the class. For example, [\dABCDEF] matches any hexadeci-
4753         mal  digit.  In UTF-8 mode, the PCRE_UCP option affects the meanings of         mal digit. In UTF modes, the PCRE_UCP option affects  the  meanings  of
4754         \d, \s, \w and their upper case partners, just as  it  does  when  they         \d,  \s,  \w  and  their upper case partners, just as it does when they
4755         appear  outside a character class, as described in the section entitled         appear outside a character class, as described in the section  entitled
4756         "Generic character types" above. The escape sequence \b has a different         "Generic character types" above. The escape sequence \b has a different
4757         meaning  inside  a character class; it matches the backspace character.         meaning inside a character class; it matches the  backspace  character.
4758         The sequences \B, \N, \R, and \X are not  special  inside  a  character         The  sequences  \B,  \N,  \R, and \X are not special inside a character
4759         class.  Like  any other unrecognized escape sequences, they are treated         class. Like any other unrecognized escape sequences, they  are  treated
4760         as the literal characters "B", "N", "R", and "X" by default, but  cause         as  the literal characters "B", "N", "R", and "X" by default, but cause
4761         an error if the PCRE_EXTRA option is set.         an error if the PCRE_EXTRA option is set.
4762    
4763         A  circumflex  can  conveniently  be used with the upper case character         A circumflex can conveniently be used with  the  upper  case  character
4764         types to specify a more restricted set of characters than the  matching         types  to specify a more restricted set of characters than the matching
4765         lower  case  type.  For example, the class [^\W_] matches any letter or         lower case type.  For example, the class [^\W_] matches any  letter  or
4766         digit, but not underscore, whereas [\w] includes underscore. A positive         digit, but not underscore, whereas [\w] includes underscore. A positive
4767         character class should be read as "something OR something OR ..." and a         character class should be read as "something OR something OR ..." and a
4768         negative class as "NOT something AND NOT something AND NOT ...".         negative class as "NOT something AND NOT something AND NOT ...".
4769    
4770         The only metacharacters that are recognized in  character  classes  are         The  only  metacharacters  that are recognized in character classes are
4771         backslash,  hyphen  (only  where  it can be interpreted as specifying a         backslash, hyphen (only where it can be  interpreted  as  specifying  a
4772         range), circumflex (only at the start), opening  square  bracket  (only         range),  circumflex  (only  at the start), opening square bracket (only
4773         when  it can be interpreted as introducing a POSIX class name - see the         when it can be interpreted as introducing a POSIX class name - see  the
4774         next section), and the terminating  closing  square  bracket.  However,         next  section),  and  the  terminating closing square bracket. However,
4775         escaping other non-alphanumeric characters does no harm.         escaping other non-alphanumeric characters does no harm.
4776    
4777    
4778  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
4779    
4780         Perl supports the POSIX notation for character classes. This uses names         Perl supports the POSIX notation for character classes. This uses names
4781         enclosed by [: and :] within the enclosing square brackets.  PCRE  also         enclosed  by  [: and :] within the enclosing square brackets. PCRE also
4782         supports this notation. For example,         supports this notation. For example,
4783    
4784           [01[:alpha:]%]           [01[:alpha:]%]
# Line 4354  POSIX CHARACTER CLASSES Line 4801  POSIX CHARACTER CLASSES
4801           word     "word" characters (same as \w)           word     "word" characters (same as \w)
4802           xdigit   hexadecimal digits           xdigit   hexadecimal digits
4803    
4804         The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
4805         and  space  (32). Notice that this list includes the VT character (code         and space (32). Notice that this list includes the VT  character  (code
4806         11). This makes "space" different to \s, which does not include VT (for         11). This makes "space" different to \s, which does not include VT (for
4807         Perl compatibility).         Perl compatibility).
4808    
4809         The  name  "word"  is  a Perl extension, and "blank" is a GNU extension         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
4810         from Perl 5.8. Another Perl extension is negation, which  is  indicated         from  Perl  5.8. Another Perl extension is negation, which is indicated
4811         by a ^ character after the colon. For example,         by a ^ character after the colon. For example,
4812    
4813           [12[:^digit:]]           [12[:^digit:]]
4814    
4815         matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the         matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
4816         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4817         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
4818    
4819         By  default,  in UTF-8 mode, characters with values greater than 128 do         By default, in UTF modes, characters with values greater  than  128  do
4820         not match any of the POSIX character classes. However, if the  PCRE_UCP         not  match any of the POSIX character classes. However, if the PCRE_UCP
4821         option  is passed to pcre_compile(), some of the classes are changed so         option is passed to pcre_compile(), some of the classes are changed  so
4822         that Unicode character properties are used. This is achieved by replac-         that Unicode character properties are used. This is achieved by replac-
4823         ing the POSIX classes by other sequences, as follows:         ing the POSIX classes by other sequences, as follows:
4824    
# Line 4384  POSIX CHARACTER CLASSES Line 4831  POSIX CHARACTER CLASSES
4831           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
4832           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
4833    
4834         Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other
4835         POSIX classes are unchanged, and match only characters with code points         POSIX classes are unchanged, and match only characters with code points
4836         less than 128.         less than 128.
4837    
4838    
4839  VERTICAL BAR  VERTICAL BAR
4840    
4841         Vertical  bar characters are used to separate alternative patterns. For         Vertical bar characters are used to separate alternative patterns.  For
4842         example, the pattern         example, the pattern
4843    
4844           gilbert|sullivan           gilbert|sullivan
4845    
4846         matches either "gilbert" or "sullivan". Any number of alternatives  may         matches  either "gilbert" or "sullivan". Any number of alternatives may
4847         appear,  and  an  empty  alternative  is  permitted (matching the empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
4848         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
4849         to  right, and the first one that succeeds is used. If the alternatives         to right, and the first one that succeeds is used. If the  alternatives
4850         are within a subpattern (defined below), "succeeds" means matching  the         are  within a subpattern (defined below), "succeeds" means matching the
4851         rest of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
4852    
4853    
4854  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
4855    
4856         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
4857         PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
4858         within  the  pattern  by  a  sequence  of  Perl option letters enclosed         within the pattern by  a  sequence  of  Perl  option  letters  enclosed
4859         between "(?" and ")".  The option letters are         between "(?" and ")".  The option letters are
4860    
4861           i  for PCRE_CASELESS           i  for PCRE_CASELESS
# Line 4418  INTERNAL OPTION SETTING Line 4865  INTERNAL OPTION SETTING
4865    
4866         For example, (?im) sets caseless, multiline matching. It is also possi-         For example, (?im) sets caseless, multiline matching. It is also possi-
4867         ble to unset these options by preceding the letter with a hyphen, and a         ble to unset these options by preceding the letter with a hyphen, and a
4868         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-         combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
4869         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,         LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
4870         is also permitted. If a  letter  appears  both  before  and  after  the         is  also  permitted.  If  a  letter  appears  both before and after the
4871         hyphen, the option is unset.         hyphen, the option is unset.
4872    
4873         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
4874         can be changed in the same way as the Perl-compatible options by  using         can  be changed in the same way as the Perl-compatible options by using
4875         the characters J, U and X respectively.         the characters J, U and X respectively.
4876    
4877         When  one  of  these  option  changes occurs at top level (that is, not         When one of these option changes occurs at  top  level  (that  is,  not
4878         inside subpattern parentheses), the change applies to the remainder  of         inside  subpattern parentheses), the change applies to the remainder of
4879         the pattern that follows. If the change is placed right at the start of         the pattern that follows. If the change is placed right at the start of
4880         a pattern, PCRE extracts it into the global options (and it will there-         a pattern, PCRE extracts it into the global options (and it will there-
4881         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4882    
4883         An  option  change  within a subpattern (see below for a description of         An option change within a subpattern (see below for  a  description  of
4884         subpatterns) affects only that part of the subpattern that follows  it,         subpatterns)  affects only that part of the subpattern that follows it,
4885         so         so
4886    
4887           (a(?i)b)c           (a(?i)b)c
4888    
4889         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4890         used).  By this means, options can be made to have  different  settings         used).   By  this means, options can be made to have different settings
4891         in  different parts of the pattern. Any changes made in one alternative         in different parts of the pattern. Any changes made in one  alternative
4892         do carry on into subsequent branches within the  same  subpattern.  For         do  carry  on  into subsequent branches within the same subpattern. For
4893         example,         example,
4894    
4895           (a(?i)b|c)           (a(?i)b|c)
4896    
4897         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
4898         first branch is abandoned before the option setting.  This  is  because         first  branch  is  abandoned before the option setting. This is because
4899         the  effects  of option settings happen at compile time. There would be         the effects of option settings happen at compile time. There  would  be
4900         some very weird behaviour otherwise.         some very weird behaviour otherwise.
4901    
4902         Note: There are other PCRE-specific options that  can  be  set  by  the         Note:  There  are  other  PCRE-specific  options that can be set by the
4903         application  when  the  compile  or match functions are called. In some         application when the compiling or matching  functions  are  called.  In
4904         cases the pattern can contain special leading sequences such as (*CRLF)         some  cases  the  pattern can contain special leading sequences such as
4905         to  override  what  the application has set or what has been defaulted.         (*CRLF) to override what the application  has  set  or  what  has  been
4906         Details are given in the section entitled  "Newline  sequences"  above.         defaulted.   Details   are  given  in  the  section  entitled  "Newline
4907         There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be         sequences" above. There are also  the  (*UTF8),  (*UTF16),  and  (*UCP)
4908         used to set UTF-8 and Unicode property modes; they  are  equivalent  to         leading  sequences  that  can  be  used to set UTF and Unicode property
4909         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.         modes; they are equivalent to setting the  PCRE_UTF8,  PCRE_UTF16,  and
4910           the PCRE_UCP options, respectively.
4911    
4912    
4913  SUBPATTERNS  SUBPATTERNS
# Line 4477  SUBPATTERNS Line 4925  SUBPATTERNS
4925         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
4926         that, when the whole pattern  matches,  that  portion  of  the  subject         that, when the whole pattern  matches,  that  portion  of  the  subject
4927         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4928         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector argument of the matching function. (This applies  only  to  the
4929         left  to  right  (starting  from 1) to obtain numbers for the capturing         traditional  matching functions; the DFA matching functions do not sup-
4930         subpatterns. For example, if the  string  "the  red  king"  is  matched         port capturing.)
4931         against the pattern  
4932           Opening parentheses are counted from left to right (starting from 1) to
4933           obtain  numbers  for  the  capturing  subpatterns.  For example, if the
4934           string "the red king" is matched against the pattern
4935    
4936           the ((red|white) (king|queen))           the ((red|white) (king|queen))
4937    
4938         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
4939         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
4940    
4941         The fact that plain parentheses fulfil  two  functions  is  not  always         The  fact  that  plain  parentheses  fulfil two functions is not always
4942         helpful.   There are often times when a grouping subpattern is required         helpful.  There are often times when a grouping subpattern is  required
4943         without a capturing requirement. If an opening parenthesis is  followed         without  a capturing requirement. If an opening parenthesis is followed
4944         by  a question mark and a colon, the subpattern does not do any captur-         by a question mark and a colon, the subpattern does not do any  captur-
4945         ing, and is not counted when computing the  number  of  any  subsequent         ing,  and  is  not  counted when computing the number of any subsequent
4946         capturing  subpatterns. For example, if the string "the white queen" is         capturing subpatterns. For example, if the string "the white queen"  is
4947         matched against the pattern         matched against the pattern
4948    
4949           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
# Line 4500  SUBPATTERNS Line 4951  SUBPATTERNS
4951         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
4952         1 and 2. The maximum number of capturing subpatterns is 65535.         1 and 2. The maximum number of capturing subpatterns is 65535.
4953    
4954         As  a  convenient shorthand, if any option settings are required at the         As a convenient shorthand, if any option settings are required  at  the
4955         start of a non-capturing subpattern,  the  option  letters  may  appear         start  of  a  non-capturing  subpattern,  the option letters may appear
4956         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
4957    
4958           (?i:saturday|sunday)           (?i:saturday|sunday)
4959           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
4960    
4961         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
4962         tried from left to right, and options are not reset until  the  end  of         tried  from  left  to right, and options are not reset until the end of
4963         the  subpattern is reached, an option setting in one branch does affect         the subpattern is reached, an option setting in one branch does  affect
4964         subsequent branches, so the above patterns match "SUNDAY"  as  well  as         subsequent  branches,  so  the above patterns match "SUNDAY" as well as
4965         "Saturday".         "Saturday".
4966    
4967    
4968  DUPLICATE SUBPATTERN NUMBERS  DUPLICATE SUBPATTERN NUMBERS
4969    
4970         Perl 5.10 introduced a feature whereby each alternative in a subpattern         Perl 5.10 introduced a feature whereby each alternative in a subpattern
4971         uses the same numbers for its capturing parentheses. Such a  subpattern         uses  the same numbers for its capturing parentheses. Such a subpattern
4972         starts  with (?| and is itself a non-capturing subpattern. For example,         starts with (?| and is itself a non-capturing subpattern. For  example,
4973         consider this pattern:         consider this pattern:
4974    
4975           (?|(Sat)ur|(Sun))day           (?|(Sat)ur|(Sun))day
4976    
4977         Because the two alternatives are inside a (?| group, both sets of  cap-         Because  the two alternatives are inside a (?| group, both sets of cap-
4978         turing  parentheses  are  numbered one. Thus, when the pattern matches,         turing parentheses are numbered one. Thus, when  the  pattern  matches,
4979         you can look at captured substring number  one,  whichever  alternative         you  can  look  at captured substring number one, whichever alternative
4980         matched.  This  construct  is useful when you want to capture part, but         matched. This construct is useful when you want to  capture  part,  but
4981         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
4982         theses  are  numbered as usual, but the number is reset at the start of         theses are numbered as usual, but the number is reset at the  start  of
4983         each branch. The numbers of any capturing parentheses that  follow  the         each  branch.  The numbers of any capturing parentheses that follow the
4984         subpattern  start after the highest number used in any branch. The fol-         subpattern start after the highest number used in any branch. The  fol-
4985         lowing example is taken from the Perl documentation. The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
4986         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
4987    
# Line 4538  DUPLICATE SUBPATTERN NUMBERS Line 4989  DUPLICATE SUBPATTERN NUMBERS
4989           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
4990           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
4991    
4992         A  back  reference  to a numbered subpattern uses the most recent value         A back reference to a numbered subpattern uses the  most  recent  value
4993         that is set for that number by any subpattern.  The  following  pattern         that  is  set  for that number by any subpattern. The following pattern
4994         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
4995    
4996           /(?|(abc)|(def))\1/           /(?|(abc)|(def))\1/
4997    
4998         In  contrast,  a subroutine call to a numbered subpattern always refers         In contrast, a subroutine call to a numbered subpattern  always  refers
4999         to the first one in the pattern with the given  number.  The  following         to  the  first  one in the pattern with the given number. The following
5000         pattern matches "abcabc" or "defabc":         pattern matches "abcabc" or "defabc":
5001    
5002           /(?|(abc)|(def))(?1)/           /(?|(abc)|(def))(?1)/
5003    
5004         If  a condition test for a subpattern's having matched refers to a non-         If a condition test for a subpattern's having matched refers to a  non-
5005         unique number, the test is true if any of the subpatterns of that  num-         unique  number, the test is true if any of the subpatterns of that num-
5006         ber have matched.         ber have matched.
5007    
5008         An  alternative approach to using this "branch reset" feature is to use         An alternative approach to using this "branch reset" feature is to  use
5009         duplicate named subpatterns, as described in the next section.         duplicate named subpatterns, as described in the next section.
5010    
5011    
5012  NAMED SUBPATTERNS  NAMED SUBPATTERNS
5013    
5014         Identifying capturing parentheses by number is simple, but  it  can  be         Identifying  capturing  parentheses  by number is simple, but it can be
5015         very  hard  to keep track of the numbers in complicated regular expres-         very hard to keep track of the numbers in complicated  regular  expres-
5016         sions. Furthermore, if an  expression  is  modified,  the  numbers  may         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
5017         change.  To help with this difficulty, PCRE supports the naming of sub-         change. To help with this difficulty, PCRE supports the naming of  sub-
5018         patterns. This feature was not added to Perl until release 5.10. Python         patterns. This feature was not added to Perl until release 5.10. Python
5019         had  the  feature earlier, and PCRE introduced it at release 4.0, using         had the feature earlier, and PCRE introduced it at release  4.0,  using
5020         the Python syntax. PCRE now supports both the Perl and the Python  syn-         the  Python syntax. PCRE now supports both the Perl and the Python syn-
5021         tax.  Perl  allows  identically  numbered subpatterns to have different         tax. Perl allows identically numbered  subpatterns  to  have  different
5022         names, but PCRE does not.         names, but PCRE does not.
5023    
5024         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)         In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)
5025         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References         or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
5026         to capturing parentheses from other parts of the pattern, such as  back         to  capturing parentheses from other parts of the pattern, such as back
5027         references,  recursion,  and conditions, can be made by name as well as         references, recursion, and conditions, can be made by name as  well  as
5028         by number.         by number.
5029    
5030         Names consist of up to  32  alphanumeric  characters  and  underscores.         Names  consist  of  up  to  32 alphanumeric characters and underscores.
5031         Named  capturing  parentheses  are  still  allocated numbers as well as         Named capturing parentheses are still  allocated  numbers  as  well  as
5032         names, exactly as if the names were not present. The PCRE API  provides         names,  exactly as if the names were not present. The PCRE API provides
5033         function calls for extracting the name-to-number translation table from         function calls for extracting the name-to-number translation table from
5034         a compiled pattern. There is also a convenience function for extracting         a compiled pattern. There is also a convenience function for extracting
5035         a captured substring by name.         a captured substring by name.
5036    
5037         By  default, a name must be unique within a pattern, but it is possible         By default, a name must be unique within a pattern, but it is  possible
5038         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
5039         time.  (Duplicate  names are also always permitted for subpatterns with         time. (Duplicate names are also always permitted for  subpatterns  with
5040         the same number, set up as described in the previous  section.)  Dupli-         the  same  number, set up as described in the previous section.) Dupli-
5041         cate  names  can  be useful for patterns where only one instance of the         cate names can be useful for patterns where only one  instance  of  the
5042         named parentheses can match. Suppose you want to match the  name  of  a         named  parentheses  can  match. Suppose you want to match the name of a
5043         weekday,  either as a 3-letter abbreviation or as the full name, and in         weekday, either as a 3-letter abbreviation or as the full name, and  in
5044         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
5045         the line breaks) does the job:         the line breaks) does the job:
5046    
# Line 4599  NAMED SUBPATTERNS Line 5050  NAMED SUBPATTERNS
5050           (?<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
5051           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
5052    
5053         There  are  five capturing substrings, but only one is ever set after a         There are five capturing substrings, but only one is ever set  after  a
5054         match.  (An alternative way of solving this problem is to use a "branch         match.  (An alternative way of solving this problem is to use a "branch
5055         reset" subpattern, as described in the previous section.)         reset" subpattern, as described in the previous section.)
5056    
5057         The  convenience  function  for extracting the data by name returns the         The convenience function for extracting the data by  name  returns  the
5058         substring for the first (and in this example, the only)  subpattern  of         substring  for  the first (and in this example, the only) subpattern of
5059         that  name  that  matched.  This saves searching to find which numbered         that name that matched. This saves searching  to  find  which  numbered
5060         subpattern it was.         subpattern it was.
5061    
5062         If you make a back reference to  a  non-unique  named  subpattern  from         If  you  make  a  back  reference to a non-unique named subpattern from
5063         elsewhere  in the pattern, the one that corresponds to the first occur-         elsewhere in the pattern, the one that corresponds to the first  occur-
5064         rence of the name is used. In the absence of duplicate numbers (see the         rence of the name is used. In the absence of duplicate numbers (see the
5065         previous  section) this is the one with the lowest number. If you use a         previous section) this is the one with the lowest number. If you use  a
5066         named reference in a condition test (see the section  about  conditions         named  reference  in a condition test (see the section about conditions
5067         below),  either  to check whether a subpattern has matched, or to check         below), either to check whether a subpattern has matched, or  to  check
5068         for recursion, all subpatterns with the same name are  tested.  If  the         for  recursion,  all  subpatterns with the same name are tested. If the
5069         condition  is  true for any one of them, the overall condition is true.         condition is true for any one of them, the overall condition  is  true.
5070         This is the same behaviour as testing by number. For further details of         This is the same behaviour as testing by number. For further details of
5071         the interfaces for handling named subpatterns, see the pcreapi documen-         the interfaces for handling named subpatterns, see the pcreapi documen-
5072         tation.         tation.
5073    
5074         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
5075         patterns  with  the same number because PCRE uses only the numbers when         patterns with the same number because PCRE uses only the  numbers  when
5076         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
5077         ent  names  are given to subpatterns with the same number. However, you         ent names are given to subpatterns with the same number.  However,  you
5078         can give the same name to subpatterns with the same number,  even  when         can  give  the same name to subpatterns with the same number, even when
5079         PCRE_DUPNAMES is not set.         PCRE_DUPNAMES is not set.
5080    
5081    
5082  REPETITION  REPETITION
5083    
5084         Repetition  is  specified  by  quantifiers, which can follow any of the         Repetition is specified by quantifiers, which can  follow  any  of  the
5085         following items:         following items:
5086    
5087           a literal data character           a literal data character
5088           the dot metacharacter           the dot metacharacter
5089           the \C escape sequence           the \C escape sequence
5090           the \X escape sequence (in UTF-8 mode with Unicode properties)           the \X escape sequence
5091           the \R escape sequence           the \R escape sequence
5092           an escape such as \d or \pL that matches a single character           an escape such