/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 74 by nigel, Sat Feb 24 21:40:30 2007 UTC revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
# Line 12  PCRE(3) Line 13  PCRE(3)
13  NAME  NAME
14         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
15    
16  DESCRIPTION  INTRODUCTION
17    
18         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
19         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
20         just  a  few  differences.  The current implementation of PCRE (release         just  a  few  differences.  The current implementation of PCRE (release
21         4.x) corresponds approximately with Perl  5.8,  including  support  for         5.x) corresponds approximately with Perl  5.8,  including  support  for
22         UTF-8  encoded  strings.   However,  this  support has to be explicitly         UTF-8 encoded strings and Unicode general category properties. However,
23         enabled; it is not the default.         this support has to be explicitly enabled; it is not the default.
24    
25         PCRE is written in C and released as a C library. However, a number  of         PCRE is written in C and released as a C library. A  number  of  people
26         people  have  written  wrappers  and interfaces of various kinds. A C++         have  written  wrappers and interfaces of various kinds. A C++ class is
27         class is included in these contributions, which can  be  found  in  the         included in these contributions, which can  be  found  in  the  Contrib
28         Contrib directory at the primary FTP site, which is:         directory at the primary FTP site, which is:
29    
30         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
31    
# Line 34  DESCRIPTION Line 35  DESCRIPTION
35    
36         Some  features  of  PCRE can be included, excluded, or changed when the         Some  features  of  PCRE can be included, excluded, or changed when the
37         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
38         client  to  discover  which features are available. Documentation about         client  to  discover  which  features are available. The features them-
39         building PCRE for various operating systems can be found in the  README         selves are described in the pcrebuild page. Documentation about  build-
40         file in the source distribution.         ing  PCRE for various operating systems can be found in the README file
41           in the source distribution.
42    
43    
44  USER DOCUMENTATION  USER DOCUMENTATION
45    
46         The user documentation for PCRE has been split up into a number of dif-         The user documentation for PCRE comprises a number  of  different  sec-
47         ferent sections. In the "man" format, each of these is a separate  "man         tions.  In the "man" format, each of these is a separate "man page". In
48         page".  In  the  HTML  format, each is a separate page, linked from the         the HTML format, each is a separate page, linked from the  index  page.
49         index page. In the plain text format, all  the  sections  are  concate-         In  the  plain text format, all the sections are concatenated, for ease
50         nated, for ease of searching. The sections are as follows:         of searching. The sections are as follows:
51    
52           pcre              this document           pcre              this document
53           pcreapi           details of PCRE's native API           pcreapi           details of PCRE's native API
# Line 53  USER DOCUMENTATION Line 55  USER DOCUMENTATION
55           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
56           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
57           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
58             pcrepartial       details of the partial matching facility
59           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
60                               regular expressions                               regular expressions
61           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
62           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible API
63             pcreprecompile    details of saving and re-using precompiled patterns
64           pcresample        discussion of the sample program           pcresample        discussion of the sample program
65           pcretest          the pcretest testing command           pcretest          description of the pcretest testing command
66    
67         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
68         each library function, listing its arguments and results.         each library function, listing its arguments and results.
# Line 74  LIMITATIONS Line 78  LIMITATIONS
78         process  regular  expressions  that are truly enormous, you can compile         process  regular  expressions  that are truly enormous, you can compile
79         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
80         the  source  distribution and the pcrebuild documentation for details).         the  source  distribution and the pcrebuild documentation for details).
81         If these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
82         of execution will be slower.         of execution will be slower.
83    
84         All values in repeating quantifiers must be less than 65536.  The maxi-         All values in repeating quantifiers must be less than 65536.  The maxi-
# Line 92  LIMITATIONS Line 96  LIMITATIONS
96         processed by certain patterns.         processed by certain patterns.
97    
98    
99  UTF-8 SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
100    
101         Starting  at  release  3.3,  PCRE  has  had  some support for character         From  release  3.3,  PCRE  has  had  some support for character strings
102         strings encoded in the UTF-8 format. For  release  4.0  this  has  been         encoded in the UTF-8 format. For release 4.0 this was greatly  extended
103         greatly extended to cover most common requirements.         to  cover  most common requirements, and in release 5.0 additional sup-
104           port for Unicode general category properties was added.
105         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8  
106         support in the code, and, in addition,  you  must  call  pcre_compile()         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
107         with  the PCRE_UTF8 option flag. When you do this, both the pattern and         support  in  the  code,  and, in addition, you must call pcre_compile()
108         any subject strings that are matched against it are  treated  as  UTF-8         with the PCRE_UTF8 option flag. When you do this, both the pattern  and
109           any  subject  strings  that are matched against it are treated as UTF-8
110         strings instead of just strings of bytes.         strings instead of just strings of bytes.
111    
112         If  you compile PCRE with UTF-8 support, but do not use it at run time,         If you compile PCRE with UTF-8 support, but do not use it at run  time,
113         the library will be a bit bigger, but the additional run time  overhead         the  library will be a bit bigger, but the additional run time overhead
114         is  limited  to testing the PCRE_UTF8 flag in several places, so should         is limited to testing the PCRE_UTF8 flag in several places,  so  should
115         not be very large.         not be very large.
116    
117           If PCRE is built with Unicode character property support (which implies
118           UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-
119           ported.  The available properties that can be tested are limited to the
120           general category properties such as Lu for an upper case letter  or  Nd
121           for  a decimal number. A full list is given in the pcrepattern documen-
122           tation. The PCRE library is increased in size by about 90K when Unicode
123           property support is included.
124    
125         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
126    
127         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and
128         subjects  are  checked for validity on entry to the relevant functions.         subjects are checked for validity on entry to the  relevant  functions.
129         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
130         situations,  you  may  already  know  that  your strings are valid, and         situations, you may already know  that  your  strings  are  valid,  and
131         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
132         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,
133         PCRE assumes that the pattern or subject  it  is  given  (respectively)         PCRE  assumes  that  the  pattern or subject it is given (respectively)
134         contains  only valid UTF-8 codes. In this case, it does not diagnose an         contains only valid UTF-8 codes. In this case, it does not diagnose  an
135         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
136         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may
137         crash.         crash.
138    
139         2. In a pattern, the escape sequence \x{...}, where the contents of the         2. In a pattern, the escape sequence \x{...}, where the contents of the
140         braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8         braces is a string of hexadecimal digits, is  interpreted  as  a  UTF-8
141         character whose code number is the given hexadecimal number, for  exam-         character  whose code number is the given hexadecimal number, for exam-
142         ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,         ple: \x{1234}. If a non-hexadecimal digit appears between  the  braces,
143         the item is not recognized.  This escape sequence can be used either as         the item is not recognized.  This escape sequence can be used either as
144         a literal, or within a character class.         a literal, or within a character class.
145    
146         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte         3. The original hexadecimal escape sequence, \xhh, matches  a  two-byte
147         UTF-8 character if the value is greater than 127.         UTF-8 character if the value is greater than 127.
148    
149         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-
150         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
151    
152         5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a         5. The dot metacharacter matches one UTF-8 character instead of a  sin-
153         single byte.         gle byte.
154    
155         6. The escape sequence \C can be used to match a single byte  in  UTF-8         6.  The  escape sequence \C can be used to match a single byte in UTF-8
156         mode, but its use can lead to some strange effects.         mode, but its use can lead to some strange effects.
157    
158         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
159         test characters of any code value, but the characters that PCRE  recog-         test  characters of any code value, but the characters that PCRE recog-
160         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes as digits, spaces, or word characters  remain  the  same  set  as
161         before, all with values less than 256.         before, all with values less than 256. This remains true even when PCRE
162           includes Unicode property support, because to do otherwise  would  slow
163         8. Case-insensitive matching applies only to  characters  whose  values         down  PCRE in many common cases. If you really want to test for a wider
164         are  less  than  256.  PCRE  does  not support the notion of "case" for         sense of, say, "digit", you must use Unicode  property  tests  such  as
165         higher-valued characters.         \p{Nd}.
166    
167         9. PCRE does not support the use of Unicode tables  and  properties  or         8.  Similarly,  characters that match the POSIX named character classes
168         the Perl escapes \p, \P, and \X.         are all low-valued characters.
169    
170           9. Case-insensitive matching applies only to  characters  whose  values
171           are  less than 128, unless PCRE is built with Unicode property support.
172           Even when Unicode property support is available, PCRE  still  uses  its
173           own  character  tables when checking the case of low-valued characters,
174           so as not to degrade performance.  The Unicode property information  is
175           used only for characters with higher values.
176    
177    
178  AUTHOR  AUTHOR
# Line 162  AUTHOR Line 182  AUTHOR
182         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
183         Phone: +44 1223 334714         Phone: +44 1223 334714
184    
185  Last updated: 20 August 2003  Last updated: 09 September 2004
186  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
187  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
188    
189  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 177  PCRE BUILD-TIME OPTIONS Line 197  PCRE BUILD-TIME OPTIONS
197    
198         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
199         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. They are all selected, or  dese-
200         lected,  by  providing  options  to  the  configure script which is run         lected, by providing options to the configure script that is run before
201         before the make command. The complete list  of  options  for  configure         the make command. The complete list of  options  for  configure  (which
202         (which  includes the standard ones such as the selection of the instal-         includes  the  standard  ones such as the selection of the installation
203         lation directory) can be obtained by running         directory) can be obtained by running
204    
205           ./configure --help           ./configure --help
206    
# Line 204  UTF-8 SUPPORT Line 224  UTF-8 SUPPORT
224         function.         function.
225    
226    
227    UNICODE CHARACTER PROPERTY SUPPORT
228    
229           UTF-8 support allows PCRE to process character values greater than  255
230           in  the  strings that it handles. On its own, however, it does not pro-
231           vide any facilities for accessing the properties of such characters. If
232           you  want  to  be able to use the pattern escapes \P, \p, and \X, which
233           refer to Unicode character properties, you must add
234    
235             --enable-unicode-properties
236    
237           to the configure command. This implies UTF-8 support, even if you  have
238           not explicitly requested it.
239    
240           Including  Unicode  property  support  adds around 90K of tables to the
241           PCRE library, approximately doubling its size. Only the  general  cate-
242           gory  properties  such as Lu and Nd are supported. Details are given in
243           the pcrepattern documentation.
244    
245    
246  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
247    
248         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE treats character 10 (linefeed) as the newline  charac-
# Line 231  BUILDING SHARED AND STATIC LIBRARIES Line 270  BUILDING SHARED AND STATIC LIBRARIES
270    
271  POSIX MALLOC USAGE  POSIX MALLOC USAGE
272    
273         When PCRE is called through the  POSIX  interface  (see  the  pcreposix         When PCRE is called through the POSIX interface (see the pcreposix doc-
274         documentation),  additional working storage is required for holding the         umentation),  additional  working  storage  is required for holding the
275         pointers to capturing substrings because PCRE requires  three  integers         pointers to capturing substrings, because PCRE requires three  integers
276         per  substring,  whereas  the POSIX interface provides only two. If the         per  substring,  whereas  the POSIX interface provides only two. If the
277         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
278         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
# Line 247  POSIX MALLOC USAGE Line 286  POSIX MALLOC USAGE
286    
287  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
288    
289         Internally,  PCRE  has a function called match() which it calls repeat-         Internally,  PCRE has a function called match(), which it calls repeat-
290         edly (possibly recursively) when performing a  matching  operation.  By         edly (possibly recursively) when matching a pattern. By controlling the
291         limiting  the  number of times this function may be called, a limit can         maximum  number  of  times  this function may be called during a single
292         be placed on the resources used by a single call  to  pcre_exec().  The         matching operation, a limit can be placed on the resources  used  by  a
293         limit  can be changed at run time, as described in the pcreapi documen-         single  call  to  pcre_exec(). The limit can be changed at run time, as
294         tation. The default is 10 million, but this can be changed by adding  a         described in the pcreapi documentation. The default is 10 million,  but
295         setting such as         this can be changed by adding a setting such as
296    
297           --with-match-limit=500000           --with-match-limit=500000
298    
# Line 264  HANDLING VERY LARGE PATTERNS Line 303  HANDLING VERY LARGE PATTERNS
303    
304         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
305         part to another (for example, from an opening parenthesis to an  alter-         part to another (for example, from an opening parenthesis to an  alter-
306         nation  metacharacter).  By  default two-byte values are used for these         nation  metacharacter).  By default, two-byte values are used for these
307         offsets, leading to a maximum size for a  compiled  pattern  of  around         offsets, leading to a maximum size for a  compiled  pattern  of  around
308         64K.  This  is sufficient to handle all but the most gigantic patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
309         Nevertheless, some people do want to process enormous patterns,  so  it         Nevertheless, some people do want to process enormous patterns,  so  it
# Line 297  AVOIDING EXCESSIVE STACK USAGE Line 336  AVOIDING EXCESSIVE STACK USAGE
336           --disable-stack-for-recursion           --disable-stack-for-recursion
337    
338         to the configure command. With this configuration, PCRE  will  use  the         to the configure command. With this configuration, PCRE  will  use  the
339         pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
340         management functions. Separate functions are provided because the usage         ment functions. Separate functions are provided because  the  usage  is
341         is very predictable: the block sizes requested are always the same, and         very  predictable:  the  block sizes requested are always the same, and
342         the blocks are always freed in reverse order. A calling  program  might         the blocks are always freed in reverse order. A calling  program  might
343         be  able  to implement optimized functions that perform better than the         be  able  to implement optimized functions that perform better than the
344         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
# Line 309  AVOIDING EXCESSIVE STACK USAGE Line 348  AVOIDING EXCESSIVE STACK USAGE
348  USING EBCDIC CODE  USING EBCDIC CODE
349    
350         PCRE  assumes  by  default that it will run in an environment where the         PCRE  assumes  by  default that it will run in an environment where the
351         character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
352         can, however, be compiled to run in an EBCDIC environment by adding         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by
353           adding
354    
355           --enable-ebcdic           --enable-ebcdic
356    
357         to the configure command.         to the configure command.
358    
359  Last updated: 09 December 2003  Last updated: 09 September 2004
360  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
361  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
362    
363  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 327  PCRE(3) Line 367  PCRE(3)
367  NAME  NAME
368         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
369    
370  SYNOPSIS OF PCRE API  PCRE NATIVE API
371    
372         #include <pcre.h>         #include <pcre.h>
373    
# Line 392  SYNOPSIS OF PCRE API Line 432  SYNOPSIS OF PCRE API
432         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
433    
434    
435  PCRE API  PCRE API OVERVIEW
436    
437         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
438         is also a set of wrapper functions that correspond to the POSIX regular         is also a set of wrapper functions that correspond to the POSIX regular
439         expression API.  These are described in the pcreposix documentation.         expression API.  These are described in the pcreposix documentation.
440    
441         The  native  API  function  prototypes  are  defined in the header file         The  native  API  function  prototypes  are  defined in the header file
442         pcre.h, and on Unix systems the library itself is called libpcre.a,  so         pcre.h, and on Unix systems the library itself is  called  libpcre.  It
443         can be accessed by adding -lpcre to the command for linking an applica-         can normally be accessed by adding -lpcre to the command for linking an
444         tion which calls it. The header file defines the macros PCRE_MAJOR  and         application  that  uses  PCRE.  The  header  file  defines  the  macros
445         PCRE_MINOR  to  contain  the  major  and  minor release numbers for the         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
446         library. Applications can use these to include  support  for  different         bers for the library.  Applications can use these  to  include  support
447         releases.         for different releases of PCRE.
448    
449         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used
450         for compiling and matching regular expressions. A sample  program  that         for compiling and matching regular expressions. A sample  program  that
451         demonstrates  the simplest way of using them is given in the file pcre-         demonstrates  the  simplest  way  of using them is provided in the file
452         demo.c. The pcresample documentation describes how to run it.         called pcredemo.c in the source distribution. The pcresample documenta-
453           tion describes how to run it.
454         There are convenience functions for extracting captured substrings from  
455         a matched subject string. They are:         In  addition  to  the  main compiling and matching functions, there are
456           convenience functions for extracting captured substrings from a matched
457           subject string.  They are:
458    
459           pcre_copy_substring()           pcre_copy_substring()
460           pcre_copy_named_substring()           pcre_copy_named_substring()
461           pcre_get_substring()           pcre_get_substring()
462           pcre_get_named_substring()           pcre_get_named_substring()
463           pcre_get_substring_list()           pcre_get_substring_list()
464             pcre_get_stringnumber()
465    
466         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
467         to free the memory used for extracted strings.         to free the memory used for extracted strings.
468    
469         The function pcre_maketables() is used (optionally) to build a  set  of         The function pcre_maketables() is used to  build  a  set  of  character
470         character tables in the current locale for passing to pcre_compile().         tables   in  the  current  locale  for  passing  to  pcre_compile()  or
471           pcre_exec().  This is an optional facility that is  provided  for  spe-
472         The  function  pcre_fullinfo()  is used to find out information about a         cialist use. Most commonly, no special tables are passed, in which case
473         compiled pattern; pcre_info() is an obsolete version which returns only         internal tables that are generated when PCRE is built are used.
474         some  of  the available information, but is retained for backwards com-  
475         patibility.  The function pcre_version() returns a pointer to a  string         The function pcre_fullinfo() is used to find out  information  about  a
476           compiled  pattern; pcre_info() is an obsolete version that returns only
477           some of the available information, but is retained for  backwards  com-
478           patibility.   The function pcre_version() returns a pointer to a string
479         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
480    
481         The  global  variables  pcre_malloc and pcre_free initially contain the         The global variables pcre_malloc and pcre_free  initially  contain  the
482         entry points of the standard  malloc()  and  free()  functions  respec-         entry  points  of  the  standard malloc() and free() functions, respec-
483         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
484         so a calling program can replace them if it  wishes  to  intercept  the         so  a  calling  program  can replace them if it wishes to intercept the
485         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
486    
487         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
488         indirections to memory management functions.  These  special  functions         indirections  to  memory  management functions. These special functions
489         are  used  only  when  PCRE is compiled to use the heap for remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
490         data, instead of recursive function calls. This is a  non-standard  way         data,  instead  of recursive function calls. This is a non-standard way
491         of  building  PCRE,  for  use in environments that have limited stacks.         of building PCRE, for use in environments  that  have  limited  stacks.
492         Because of the greater use of memory management, it runs  more  slowly.         Because  of  the greater use of memory management, it runs more slowly.
493         Separate  functions  are provided so that special-purpose external code         Separate functions are provided so that special-purpose  external  code
494         can be used for this case. When used, these functions are always called         can be used for this case. When used, these functions are always called
495         in  a  stack-like  manner  (last obtained, first freed), and always for         in a stack-like manner (last obtained, first  freed),  and  always  for
496         memory blocks of the same size.         memory blocks of the same size.
497    
498         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
499         by  the  caller  to  a "callout" function, which PCRE will then call at         by the caller to a "callout" function, which PCRE  will  then  call  at
500         specified points during a matching operation. Details are given in  the         specified  points during a matching operation. Details are given in the
501         pcrecallout documentation.         pcrecallout documentation.
502    
503    
504  MULTITHREADING  MULTITHREADING
505    
506         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
507         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
508         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
509         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
510    
511         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
512         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
513         at once.         at once.
514    
515    
516    SAVING PRECOMPILED PATTERNS FOR LATER USE
517    
518           The compiled form of a regular expression can be saved and re-used at a
519           later  time,  possibly by a different program, and even on a host other
520           than the one on which  it  was  compiled.  Details  are  given  in  the
521           pcreprecompile documentation.
522    
523    
524  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
525    
526         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
# Line 486  CHECKING BUILD-TIME OPTIONS Line 540  CHECKING BUILD-TIME OPTIONS
540         The  output is an integer that is set to one if UTF-8 support is avail-         The  output is an integer that is set to one if UTF-8 support is avail-
541         able; otherwise it is set to zero.         able; otherwise it is set to zero.
542    
543             PCRE_CONFIG_UNICODE_PROPERTIES
544    
545           The output is an integer that is set to  one  if  support  for  Unicode
546           character properties is available; otherwise it is set to zero.
547    
548           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
549    
550         The output is an integer that is set to the value of the code  that  is         The  output  is an integer that is set to the value of the code that is
551         used  for the newline character. It is either linefeed (10) or carriage         used for the newline character. It is either linefeed (10) or  carriage
552         return (13), and should normally be the  standard  character  for  your         return  (13),  and  should  normally be the standard character for your
553         operating system.         operating system.
554    
555           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
556    
557         The  output  is  an  integer that contains the number of bytes used for         The output is an integer that contains the number  of  bytes  used  for
558         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
559         4.  Larger  values  allow larger regular expressions to be compiled, at         4. Larger values allow larger regular expressions to  be  compiled,  at
560         the expense of slower matching. The default value of  2  is  sufficient         the  expense  of  slower matching. The default value of 2 is sufficient
561         for  all  but  the  most massive patterns, since it allows the compiled         for all but the most massive patterns, since  it  allows  the  compiled
562         pattern to be up to 64K in size.         pattern to be up to 64K in size.
563    
564           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
565    
566         The output is an integer that contains the threshold  above  which  the         The  output  is  an integer that contains the threshold above which the
567         POSIX  interface  uses malloc() for output vectors. Further details are         POSIX interface uses malloc() for output vectors. Further  details  are
568         given in the pcreposix documentation.         given in the pcreposix documentation.
569    
570           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
571    
572         The output is an integer that gives the default limit for the number of         The output is an integer that gives the default limit for the number of
573         internal  matching  function  calls in a pcre_exec() execution. Further         internal matching function calls in a  pcre_exec()  execution.  Further
574         details are given with pcre_exec() below.         details are given with pcre_exec() below.
575    
576           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
577    
578         The output is an integer that is set to one if  internal  recursion  is         The  output  is  an integer that is set to one if internal recursion is
579         implemented  by recursive function calls that use the stack to remember         implemented by recursive function calls that use the stack to  remember
580         their state. This is the usual way that PCRE is compiled. The output is         their state. This is the usual way that PCRE is compiled. The output is
581         zero  if PCRE was compiled to use blocks of data on the heap instead of         zero if PCRE was compiled to use blocks of data on the heap instead  of
582         recursive  function  calls.  In  this   case,   pcre_stack_malloc   and         recursive   function   calls.   In  this  case,  pcre_stack_malloc  and
583         pcre_stack_free  are  called  to manage memory blocks on the heap, thus         pcre_stack_free are called to manage memory blocks on  the  heap,  thus
584         avoiding the use of the stack.         avoiding the use of the stack.
585    
586    
# Line 531  COMPILING A PATTERN Line 590  COMPILING A PATTERN
590              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
591              const unsigned char *tableptr);              const unsigned char *tableptr);
592    
593           The  function  pcre_compile()  is  called  to compile a pattern into an
594         The function pcre_compile() is called to  compile  a  pattern  into  an         internal form. The pattern is a C string terminated by a  binary  zero,
595         internal  form.  The pattern is a C string terminated by a binary zero,         and  is  passed in the pattern argument. A pointer to a single block of
596         and is passed in the argument pattern. A pointer to a single  block  of         memory that is obtained via pcre_malloc is returned. This contains  the
597         memory  that is obtained via pcre_malloc is returned. This contains the         compiled  code  and  related  data.  The  pcre  type is defined for the
598         compiled code and related data.  The  pcre  type  is  defined  for  the         returned block; this is a typedef for a structure  whose  contents  are
599         returned  block;  this  is a typedef for a structure whose contents are         not  externally defined. It is up to the caller to free the memory when
        not externally defined. It is up to the caller to free the memory  when  
600         it is no longer required.         it is no longer required.
601    
602         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
603         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
604         fully relocatable, because it contains a copy of the tableptr argument,         fully relocatable, because it may contain a copy of the tableptr  argu-
605         which is an address (see below).         ment, which is an address (see below).
606    
607         The options argument contains independent bits that affect the compila-         The options argument contains independent bits that affect the compila-
608         tion.  It  should  be  zero  if  no  options  are required. Some of the         tion. It should be zero if  no  options  are  required.  The  available
609         options, in particular, those that are compatible with Perl,  can  also         options  are  described  below. Some of them, in particular, those that
610         be  set and unset from within the pattern (see the detailed description         are compatible with Perl, can also be set and  unset  from  within  the
611         of regular expressions in the  pcrepattern  documentation).  For  these         pattern  (see  the  detailed  description in the pcrepattern documenta-
612         options,  the  contents of the options argument specifies their initial         tion). For these options, the contents of the options  argument  speci-
613         settings at the start of compilation and execution.  The  PCRE_ANCHORED         fies  their initial settings at the start of compilation and execution.
614         option can be set at the time of matching as well as at compile time.         The PCRE_ANCHORED option can be set at the time of matching as well  as
615           at compile time.
616    
617         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
618         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
# Line 564  COMPILING A PATTERN Line 623  COMPILING A PATTERN
623         given.         given.
624    
625         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
626         character tables which are built when it is compiled, using the default         character tables that are  built  when  PCRE  is  compiled,  using  the
627         C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to         default  C  locale.  Otherwise, tableptr must be an address that is the
628         pcre_maketables(). See the section on locale support below.         result of a call to pcre_maketables(). This value is  stored  with  the
629           compiled  pattern,  and used again by pcre_exec(), unless another table
630           pointer is passed to it. For more discussion, see the section on locale
631           support below.
632    
633         This code fragment shows a typical straightforward  call  to  pcre_com-         This  code  fragment  shows a typical straightforward call to pcre_com-
634         pile():         pile():
635    
636           pcre *re;           pcre *re;
# Line 581  COMPILING A PATTERN Line 643  COMPILING A PATTERN
643             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
644             NULL);            /* use default character tables */             NULL);            /* use default character tables */
645    
646         The following option bits are defined:         The following names for option bits are defined in  the  pcre.h  header
647           file:
648    
649           PCRE_ANCHORED           PCRE_ANCHORED
650    
651         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
652         is constrained to match only at the first matching point in the  string         is constrained to match only at the first matching point in the  string
653         which is being searched (the "subject string"). This effect can also be         that  is being searched (the "subject string"). This effect can also be
654         achieved by appropriate constructs in the pattern itself, which is  the         achieved by appropriate constructs in the pattern itself, which is  the
655         only way to do it in Perl.         only way to do it in Perl.
656    
657             PCRE_AUTO_CALLOUT
658    
659           If this bit is set, pcre_compile() automatically inserts callout items,
660           all with number 255, before each pattern item. For  discussion  of  the
661           callout facility, see the pcrecallout documentation.
662    
663           PCRE_CASELESS           PCRE_CASELESS
664    
665         If  this  bit is set, letters in the pattern match both upper and lower         If  this  bit is set, letters in the pattern match both upper and lower
666         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
667         changed within a pattern by a (?i) option setting.         changed  within  a  pattern  by  a (?i) option setting. When running in
668           UTF-8 mode, case support for high-valued characters is  available  only
669           when PCRE is built with Unicode character property support.
670    
671           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
672    
# Line 645  COMPILING A PATTERN Line 716  COMPILING A PATTERN
716           PCRE_MULTILINE           PCRE_MULTILINE
717    
718         By default, PCRE treats the subject string as consisting  of  a  single         By default, PCRE treats the subject string as consisting  of  a  single
719         "line"  of  characters (even if it actually contains several newlines).         line  of characters (even if it actually contains newlines). The "start
720         The "start of line" metacharacter (^) matches only at the start of  the         of line" metacharacter (^) matches only at the  start  of  the  string,
721         string,  while  the "end of line" metacharacter ($) matches only at the         while  the  "end  of line" metacharacter ($) matches only at the end of
722         end of the string, or before a terminating  newline  (unless  PCRE_DOL-         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
723         LAR_ENDONLY is set). This is the same as Perl.         is set). This is the same as Perl.
724    
725         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
726         constructs match immediately following or immediately before  any  new-         constructs match immediately following or immediately before  any  new-
# Line 678  COMPILING A PATTERN Line 749  COMPILING A PATTERN
749    
750         This option causes PCRE to regard both the pattern and the  subject  as         This option causes PCRE to regard both the pattern and the  subject  as
751         strings  of  UTF-8 characters instead of single-byte character strings.         strings  of  UTF-8 characters instead of single-byte character strings.
752         However, it is available only if PCRE has been built to  include  UTF-8         However, it is available only when PCRE is built to include UTF-8  sup-
753         support.  If  not, the use of this option provokes an error. Details of         port.  If not, the use of this option provokes an error. Details of how
754         how this option changes the behaviour of PCRE are given in the  section         this option changes the behaviour of PCRE are given in the  section  on
755         on UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
756    
757           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
758    
# Line 691  COMPILING A PATTERN Line 762  COMPILING A PATTERN
762         is valid, and you want to skip this check for performance reasons,  you         is valid, and you want to skip this check for performance reasons,  you
763         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
764         passing an invalid UTF-8 string as a pattern is undefined. It may cause         passing an invalid UTF-8 string as a pattern is undefined. It may cause
765         your  program  to  crash.  Note that there is a similar option for sup-         your  program  to  crash.   Note that this option can also be passed to
766         pressing the checking of subject strings passed to pcre_exec().         pcre_exec(),  to  suppress  the  UTF-8  validity  checking  of  subject
767           strings.
768    
769    
770  STUDYING A PATTERN  STUDYING A PATTERN
# Line 701  STUDYING A PATTERN Line 772  STUDYING A PATTERN
772         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
773              const char **errptr);              const char **errptr);
774    
775         When a pattern is going to be used several times, it is worth  spending         If  a  compiled  pattern is going to be used several times, it is worth
776         more  time  analyzing it in order to speed up the time taken for match-         spending more time analyzing it in order to speed up the time taken for
777         ing. The function pcre_study() takes a pointer to a compiled pattern as         matching.  The function pcre_study() takes a pointer to a compiled pat-
778         its first argument. If studing the pattern produces additional informa-         tern as its first argument. If studying the pattern produces additional
779         tion that will help speed up matching, pcre_study() returns  a  pointer         information  that  will  help speed up matching, pcre_study() returns a
780         to  a  pcre_extra  block,  in  which the study_data field points to the         pointer to a pcre_extra block, in which the study_data field points  to
781         results of the study.         the results of the study.
782    
783         The returned value from  a  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
784         pcre_exec().  However,  the pcre_extra block also contains other fields         pcre_exec(). However, a pcre_extra block  also  contains  other  fields
785         that can be set by the caller before the block  is  passed;  these  are         that  can  be  set  by the caller before the block is passed; these are
786         described  below.  If  studying  the pattern does not produce any addi-         described below in the section on matching a pattern.
787         tional information, pcre_study() returns NULL. In that circumstance, if  
788         the  calling  program  wants  to  pass  some  of  the  other  fields to         If studying the pattern does not produce  any  additional  information,
789         pcre_exec(), it must set up its own pcre_extra block.         pcre_study() returns NULL. In that circumstance, if the calling program
790           wants to pass any of the other fields to pcre_exec(), it  must  set  up
791         The second argument contains option bits. At present,  no  options  are         its own pcre_extra block.
792         defined for pcre_study(), and this argument should always be zero.  
793           The  second  argument of pcre_study() contains option bits. At present,
794         The  third argument for pcre_study() is a pointer for an error message.         no options are defined, and this argument should always be zero.
795         If studying succeeds (even if no data is  returned),  the  variable  it  
796         points  to  is set to NULL. Otherwise it points to a textual error mes-         The third argument for pcre_study() is a pointer for an error  message.
797         sage. You should therefore test the error pointer for NULL after  call-         If  studying  succeeds  (even  if no data is returned), the variable it
798           points to is set to NULL. Otherwise it points to a textual  error  mes-
799           sage.  You should therefore test the error pointer for NULL after call-
800         ing pcre_study(), to be sure that it has run successfully.         ing pcre_study(), to be sure that it has run successfully.
801    
802         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 735  STUDYING A PATTERN Line 808  STUDYING A PATTERN
808             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
809    
810         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
811         that do not have a single fixed starting character. A bitmap of  possi-         that  do not have a single fixed starting character. A bitmap of possi-
812         ble starting characters is created.         ble starting bytes is created.
813    
814    
815  LOCALE SUPPORT  LOCALE SUPPORT
816    
817         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
818         letters, digits, or whatever, by reference to a  set  of  tables.  When         letters,  digits, or whatever, by reference to a set of tables, indexed
819         running  in UTF-8 mode, this applies only to characters with codes less         by character value. (When running in UTF-8 mode, this applies  only  to
820         than 256. The library contains a default set of tables that is  created         characters  with  codes  less than 128. Higher-valued codes never match
821         in  the  default  C locale when PCRE is compiled. This is used when the         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
822         final argument of pcre_compile() is NULL, and is  sufficient  for  many         with Unicode character property support.)
823         applications.  
824           An  internal set of tables is created in the default C locale when PCRE
825         An alternative set of tables can, however, be supplied. Such tables are         is built. This is used when the final  argument  of  pcre_compile()  is
826         built by calling the pcre_maketables() function,  which  has  no  argu-         NULL,  and  is  sufficient for many applications. An alternative set of
827         ments,  in  the  relevant  locale.  The  result  can  then be passed to         tables can, however, be supplied. These may be created in  a  different
828         pcre_compile() as often as necessary. For example,  to  build  and  use         locale  from the default. As more and more applications change to using
829         tables that are appropriate for the French locale (where accented char-         Unicode, the need for this locale support is expected to die away.
830         acters with codes greater than 128 are treated as letters), the follow-  
831         ing code could be used:         External tables are built by calling  the  pcre_maketables()  function,
832           which  has no arguments, in the relevant locale. The result can then be
833           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
834           example,  to  build  and use tables that are appropriate for the French
835           locale (where accented characters with  values  greater  than  128  are
836           treated as letters), the following code could be used:
837    
838           setlocale(LC_CTYPE, "fr");           setlocale(LC_CTYPE, "fr_FR");
839           tables = pcre_maketables();           tables = pcre_maketables();
840           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
841    
842         The  tables  are  built in memory that is obtained via pcre_malloc. The         When  pcre_maketables()  runs,  the  tables are built in memory that is
843         pointer that is passed to pcre_compile is saved with the compiled  pat-         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
844         tern, and the same tables are used via this pointer by pcre_study() and         that  the memory containing the tables remains available for as long as
845         pcre_exec(). Thus, for any single pattern,  compilation,  studying  and         it is needed.
846         matching  all  happen in the same locale, but different patterns can be  
847         compiled in different locales. It is  the  caller's  responsibility  to         The pointer that is passed to pcre_compile() is saved with the compiled
848         ensure  that  the memory containing the tables remains available for as         pattern,  and the same tables are used via this pointer by pcre_study()
849         long as it is needed.         and normally also by pcre_exec(). Thus, by default, for any single pat-
850           tern, compilation, studying and matching all happen in the same locale,
851           but different patterns can be compiled in different locales.
852    
853           It is possible to pass a table pointer or NULL (indicating the  use  of
854           the  internal  tables)  to  pcre_exec(). Although not intended for this
855           purpose, this facility could be used to match a pattern in a  different
856           locale from the one in which it was compiled. Passing table pointers at
857           run time is discussed below in the section on matching a pattern.
858    
859    
860  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
# Line 792  INFORMATION ABOUT A PATTERN Line 878  INFORMATION ABOUT A PATTERN
878           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
879           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
880    
881         Here  is a typical call of pcre_fullinfo(), to obtain the length of the         The  "magic  number" is placed at the start of each compiled pattern as
882         compiled pattern:         an simple check against passing an arbitrary memory pointer. Here is  a
883           typical  call  of pcre_fullinfo(), to obtain the length of the compiled
884           pattern:
885    
886           int rc;           int rc;
887           unsigned long int length;           unsigned long int length;
# Line 817  INFORMATION ABOUT A PATTERN Line 905  INFORMATION ABOUT A PATTERN
905         Return  the  number of capturing subpatterns in the pattern. The fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
906         argument should point to an int variable.         argument should point to an int variable.
907    
908             PCRE_INFO_DEFAULTTABLES
909    
910           Return a pointer to the internal default character tables within  PCRE.
911           The  fourth  argument should point to an unsigned char * variable. This
912           information call is provided for internal use by the pcre_study() func-
913           tion.  External  callers  can  cause PCRE to use its internal tables by
914           passing a NULL table pointer.
915    
916           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
917    
918         Return information about the first byte of any matched  string,  for  a         Return information about the first byte of any matched  string,  for  a
# Line 824  INFORMATION ABOUT A PATTERN Line 920  INFORMATION ABOUT A PATTERN
920         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards
921         compatibility.)         compatibility.)
922    
923         If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as         If  there  is  a  fixed first byte, for example, from a pattern such as
924         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.
925         Otherwise, if either         Otherwise, if either
926    
# Line 862  INFORMATION ABOUT A PATTERN Line 958  INFORMATION ABOUT A PATTERN
958    
959         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
960         ses. The names are just an additional way of identifying the  parenthe-         ses. The names are just an additional way of identifying the  parenthe-
961         ses,  which still acquire a number. A caller that wants to extract data         ses,  which  still  acquire  numbers.  A  convenience  function  called
962         from a named subpattern must convert the name to a number in  order  to         pcre_get_named_substring() is provided  for  extracting  an  individual
963         access  the  correct  pointers  in  the  output  vector (described with         captured  substring  by  name.  It is also possible to extract the data
964         pcre_exec() below). In order to do this, it must first use these  three         directly, by first converting the name to a number in order  to  access
965         values to obtain the name-to-number mapping table for the pattern.         the  correct  pointers in the output vector (described with pcre_exec()
966           below). To do the conversion, you need to use the  name-to-number  map,
967           which is described by these three values.
968    
969         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
970         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
# Line 884  INFORMATION ABOUT A PATTERN Line 982  INFORMATION ABOUT A PATTERN
982    
983         There are four named subpatterns, so the table has  four  entries,  and         There are four named subpatterns, so the table has  four  entries,  and
984         each  entry  in the table is eight bytes long. The table is as follows,         each  entry  in the table is eight bytes long. The table is as follows,
985         with non-printing bytes shows in hex, and undefined bytes shown as ??:         with non-printing bytes shows in hexadecimal, and undefined bytes shown
986           as ??:
987    
988           00 01 d  a  t  e  00 ??           00 01 d  a  t  e  00 ??
989           00 05 d  a  y  00 ?? ??           00 05 d  a  y  00 ?? ??
990           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
991           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
992    
993         When writing code to extract data from named subpatterns, remember that         When  writing  code  to  extract  data from named subpatterns using the
994         the length of each entry may be different for each compiled pattern.         name-to-number map, remember that the length of each entry is likely to
995           be different for each compiled pattern.
996    
997           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
998    
# Line 922  INFORMATION ABOUT A PATTERN Line 1022  INFORMATION ABOUT A PATTERN
1022    
1023           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1024    
1025         Returns  the  size of the data block pointed to by the study_data field         Return the size of the data block pointed to by the study_data field in
1026         in a pcre_extra block. That is, it is the  value  that  was  passed  to         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to
1027         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1028         created by pcre_study(). The fourth argument should point to  a  size_t         created by pcre_study(). The fourth argument should point to  a  size_t
1029         variable.         variable.
# Line 958  MATCHING A PATTERN Line 1058  MATCHING A PATTERN
1058              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1059    
1060         The  function pcre_exec() is called to match a subject string against a         The  function pcre_exec() is called to match a subject string against a
1061         pre-compiled pattern, which is passed in the code argument. If the pat-         compiled pattern, which is passed in the code argument. If the  pattern
1062         tern  has been studied, the result of the study should be passed in the         has been studied, the result of the study should be passed in the extra
1063         extra argument.         argument.
1064    
1065           In most applications, the pattern will have been compiled (and  option-
1066           ally  studied)  in the same process that calls pcre_exec(). However, it
1067           is possible to save compiled patterns and study data, and then use them
1068           later  in  different processes, possibly even on different hosts. For a
1069           discussion about this, see the pcreprecompile documentation.
1070    
1071         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
1072    
# Line 973  MATCHING A PATTERN Line 1079  MATCHING A PATTERN
1079             11,             /* the length of the subject string */             11,             /* the length of the subject string */
1080             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1081             0,              /* default options */             0,              /* default options */
1082             ovector,        /* vector for substring information */             ovector,        /* vector of integers for substring information */
1083             30);            /* number of elements in the vector */             30);            /* number of elements in the vector  (NOT  size  in
1084           bytes) */
1085         If the extra argument is not NULL, it must point to a  pcre_extra  data  
1086         block.  The pcre_study() function returns such a block (when it doesn't     Extra data for pcre_exec()
1087         return NULL), but you can also create one for yourself, and pass  addi-  
1088         tional information in it. The fields in the block are as follows:         If  the  extra argument is not NULL, it must point to a pcre_extra data
1089           block. The pcre_study() function returns such a block (when it  doesn't
1090           return  NULL), but you can also create one for yourself, and pass addi-
1091           tional information in it. The fields in a pcre_extra block are as  fol-
1092           lows:
1093    
1094           unsigned long int flags;           unsigned long int flags;
1095           void *study_data;           void *study_data;
1096           unsigned long int match_limit;           unsigned long int match_limit;
1097           void *callout_data;           void *callout_data;
1098             const unsigned char *tables;
1099    
1100         The  flags  field  is a bitmap that specifies which of the other fields         The  flags  field  is a bitmap that specifies which of the other fields
1101         are set. The flag bits are:         are set. The flag bits are:
# Line 992  MATCHING A PATTERN Line 1103  MATCHING A PATTERN
1103           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1104           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1105           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1106             PCRE_EXTRA_TABLES
1107    
1108         Other flag bits should be set to zero. The study_data field is  set  in         Other flag bits should be set to zero. The study_data field is  set  in
1109         the  pcre_extra  block  that is returned by pcre_study(), together with         the  pcre_extra  block  that is returned by pcre_study(), together with
1110         the appropriate flag bit. You should not set this yourself, but you can         the appropriate flag bit. You should not set this yourself, but you may
1111         add to the block by setting the other fields.         add  to  the  block by setting the other fields and their corresponding
1112           flag bits.
1113    
1114         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1115         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
1116         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
1117         search trees. The classic  example  is  the  use  of  nested  unlimited         search  trees.  The  classic  example  is  the  use of nested unlimited
1118         repeats. Internally, PCRE uses a function called match() which it calls         repeats.
1119         repeatedly (sometimes recursively). The limit is imposed on the  number  
1120         of  times  this function is called during a match, which has the effect         Internally, PCRE uses a function called match() which it calls  repeat-
1121         of limiting the amount of recursion  and  backtracking  that  can  take         edly  (sometimes  recursively).  The  limit is imposed on the number of
1122         place.  For  patterns that are not anchored, the count starts from zero         times this function is called during a match, which has the  effect  of
1123         for each position in the subject string.         limiting  the amount of recursion and backtracking that can take place.
1124           For patterns that are not anchored, the count starts from zero for each
1125         The default limit for the library can be set when PCRE  is  built;  the         position in the subject string.
1126         default  default  is 10 million, which handles all but the most extreme  
1127         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a         The  default  limit  for the library can be set when PCRE is built; the
1128         pcre_extra  block  in  which match_limit is set to a smaller value, and         default default is 10 million, which handles all but the  most  extreme
1129         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         cases.  You  can  reduce  the  default  by  suppling pcre_exec() with a
1130           pcre_extra block in which match_limit is set to a  smaller  value,  and
1131           PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1132         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1133    
1134         The  pcre_callout  field is used in conjunction with the "callout" fea-         The pcre_callout field is used in conjunction with the  "callout"  fea-
1135         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1136    
1137         The PCRE_ANCHORED option can be passed in the options  argument,  whose         The  tables  field  is  used  to  pass  a  character  tables pointer to
1138         unused  bits  must  be zero. This limits pcre_exec() to matching at the         pcre_exec(); this overrides the value that is stored with the  compiled
1139         first matching position.  However,  if  a  pattern  was  compiled  with         pattern.  A  non-NULL value is stored with the compiled pattern only if
1140         PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,         custom tables were supplied to pcre_compile() via  its  tableptr  argu-
1141         it cannot be made unachored at matching time.         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1142           PCRE's internal tables to be used. This facility is  helpful  when  re-
1143         When PCRE_UTF8 was set at compile time, the validity of the subject  as         using  patterns  that  have been saved after compiling with an external
1144         a  UTF-8  string is automatically checked, and the value of startoffset         set of tables, because the external tables  might  be  at  a  different
1145         is also checked to ensure that it points to the start of a UTF-8  char-         address  when  pcre_exec() is called. See the pcreprecompile documenta-
1146         acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()         tion for a discussion of saving compiled patterns for later use.
1147         returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an  
1148         invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.     Option bits for pcre_exec()
1149    
1150           The unused bits of the options argument for pcre_exec() must  be  zero.
1151           The   only  bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,
1152           PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
1153    
1154         If  you  already  know that your subject is valid, and you want to skip           PCRE_ANCHORED
        these   checks   for   performance   reasons,   you   can    set    the  
        PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to  
        do this for the second and subsequent calls to pcre_exec() if  you  are  
        making  repeated  calls  to  find  all  the matches in a single subject  
        string. However, you should be  sure  that  the  value  of  startoffset  
        points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is  
        set, the effect of passing an invalid UTF-8 string as a subject,  or  a  
        value  of startoffset that does not point to the start of a UTF-8 char-  
        acter, is undefined. Your program may crash.  
1155    
1156         There are also three further options that can be set only  at  matching         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1157         time:         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1158           turned out to be anchored by virtue of its contents, it cannot be  made
1159           unachored at matching time.
1160    
1161           PCRE_NOTBOL           PCRE_NOTBOL
1162    
1163         The  first  character  of the string is not the beginning of a line, so         This option specifies that first character of the subject string is not
1164         the circumflex metacharacter should not match before it.  Setting  this         the beginning of a line, so the  circumflex  metacharacter  should  not
1165         without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1166         match.         causes  circumflex  never  to  match.  This  option  affects  only  the
1167           behaviour of the circumflex metacharacter. It does not affect \A.
1168    
1169           PCRE_NOTEOL           PCRE_NOTEOL
1170    
1171         The end of the string is not the end of a line, so the dollar metachar-         This option specifies that the end of the subject string is not the end
1172         acter  should  not  match  it  nor (except in multiline mode) a newline         of a line, so the dollar metacharacter should not match it nor  (except
1173         immediately before it. Setting this without PCRE_MULTILINE (at  compile         in  multiline mode) a newline immediately before it. Setting this with-
1174         time) causes dollar never to match.         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1175           option  affects only the behaviour of the dollar metacharacter. It does
1176           not affect \Z or \z.
1177    
1178           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1179    
1180         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1181         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
1182         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
1183         example, if the pattern         example, if the pattern
1184    
1185           a?b?           a?b?
1186    
1187         is applied to a string not beginning with "a" or "b",  it  matches  the         is  applied  to  a string not beginning with "a" or "b", it matches the
1188         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
1189         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1190         rences of "a" or "b".         rences of "a" or "b".
1191    
1192         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1193         cial case of a pattern match of the empty  string  within  its  split()         cial  case  of  a  pattern match of the empty string within its split()
1194         function,  and  when  using  the /g modifier. It is possible to emulate         function, and when using the /g modifier. It  is  possible  to  emulate
1195         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1196         again at the same offset with PCRE_NOTEMPTY set, and then if that fails         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1197         by advancing the starting offset (see below)  and  trying  an  ordinary         if  that  fails by advancing the starting offset (see below) and trying
1198         match again.         an ordinary match again. There is some code that demonstrates how to do
1199           this in the pcredemo.c sample program.
1200    
1201             PCRE_NO_UTF8_CHECK
1202    
1203           When PCRE_UTF8 is set at compile time, the validity of the subject as a
1204           UTF-8 string is automatically checked when pcre_exec() is  subsequently
1205           called.   The  value  of  startoffset is also checked to ensure that it
1206           points to the start of a UTF-8 character. If an invalid UTF-8  sequence
1207           of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1208           startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is
1209           returned.
1210    
1211           If  you  already  know that your subject is valid, and you want to skip
1212           these   checks   for   performance   reasons,   you   can    set    the
1213           PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1214           do this for the second and subsequent calls to pcre_exec() if  you  are
1215           making  repeated  calls  to  find  all  the matches in a single subject
1216           string. However, you should be  sure  that  the  value  of  startoffset
1217           points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1218           set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1219           value  of startoffset that does not point to the start of a UTF-8 char-
1220           acter, is undefined. Your program may crash.
1221    
1222             PCRE_PARTIAL
1223    
1224           This option turns on the  partial  matching  feature.  If  the  subject
1225           string  fails to match the pattern, but at some point during the match-
1226           ing process the end of the subject was reached (that  is,  the  subject
1227           partially  matches  the  pattern and the failure to match occurred only
1228           because there were not enough subject characters), pcre_exec()  returns
1229           PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1230           used, there are restrictions on what may appear in the  pattern.  These
1231           are discussed in the pcrepartial documentation.
1232    
1233       The string to be matched by pcre_exec()
1234    
1235         The  subject string is passed to pcre_exec() as a pointer in subject, a         The  subject string is passed to pcre_exec() as a pointer in subject, a
1236         length in length, and a starting byte offset in startoffset. Unlike the         length in length, and a starting byte offset in startoffset.  In  UTF-8
1237         pattern  string,  the  subject  may contain binary zero bytes. When the         mode,  the  byte  offset  must point to the start of a UTF-8 character.
1238         starting offset is zero, the search for a match starts at the beginning         Unlike the pattern string, the subject may contain binary  zero  bytes.
1239         of the subject, and this is by far the most common case.         When  the starting offset is zero, the search for a match starts at the
1240           beginning of the subject, and this is by far the most common case.
1241         If the pattern was compiled with the PCRE_UTF8 option, the subject must  
1242         be a sequence of bytes that is a valid UTF-8 string, and  the  starting         A non-zero starting offset is useful when searching for  another  match
1243         offset  must point to the beginning of a UTF-8 character. If an invalid         in  the same subject by calling pcre_exec() again after a previous suc-
1244         UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8         cess.  Setting startoffset differs from just passing over  a  shortened
1245         or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
        PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not  
        defined.  
   
        A  non-zero  starting offset is useful when searching for another match  
        in the same subject by calling pcre_exec() again after a previous  suc-  
        cess.   Setting  startoffset differs from just passing over a shortened  
        string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins  
1246         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1247    
1248           \Biss\B           \Biss\B
1249    
1250         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
1251         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
1252         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
1253         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
1254         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
1255         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1256         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
1257         string again, but with startoffset  set  to  4,  it  finds  the  second         string again, but with startoffset set to 4, it finds the second occur-
1258         occurrence  of  "iss"  because  it  is able to look behind the starting         rence of "iss" because it is able to look behind the starting point  to
1259         point to discover that it is preceded by a letter.         discover that it is preceded by a letter.
1260    
1261         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
1262         one  attempt  to match at the given offset is tried. This can only suc-         one attempt to match at the given offset is made. This can only succeed
1263         ceed if the pattern does not require the match to be at  the  start  of         if  the  pattern  does  not require the match to be at the start of the
1264         the subject.         subject.
1265    
1266       How pcre_exec() returns captured substrings
1267    
1268         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
1269         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
1270         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
1271         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
1272         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
1273         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
1274         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1275    
1276         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of  integer
1277         offsets whose address is passed in ovector. The number of  elements  in         offsets  whose  address is passed in ovector. The number of elements in
1278         the vector is passed in ovecsize. The first two-thirds of the vector is         the vector is passed in ovecsize, which must be a non-negative  number.
1279         used to pass back captured substrings, each substring using a  pair  of         Note: this argument is NOT the size of ovector in bytes.
1280         integers.  The  remaining  third  of the vector is used as workspace by  
1281         pcre_exec() while matching capturing subpatterns, and is not  available         The  first  two-thirds of the vector is used to pass back captured sub-
1282         for  passing  back  information.  The  length passed in ovecsize should         strings, each substring using a pair of integers. The  remaining  third
1283         always be a multiple of three. If it is not, it is rounded down.         of  the  vector is used as workspace by pcre_exec() while matching cap-
1284           turing subpatterns, and is not available for passing back  information.
1285           The  length passed in ovecsize should always be a multiple of three. If
1286           it is not, it is rounded down.
1287    
1288         When a match has been successful, information about captured substrings         When a match is successful, information about  captured  substrings  is
1289         is returned in pairs of integers, starting at the beginning of ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
1290         and continuing up to two-thirds of its length at the  most.  The  first         and continuing up to two-thirds of its length at the  most.  The  first
1291         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1292         string, and the second is set to the  offset  of  the  first  character         string, and the second is set to the  offset  of  the  first  character
# Line 1161  MATCHING A PATTERN Line 1309  MATCHING A PATTERN
1309         offset values corresponding to the unused subpattern are set to -1.         offset values corresponding to the unused subpattern are set to -1.
1310    
1311         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1312         of the string that it matched that gets returned.         of the string that it matched that is returned.
1313    
1314         If the vector is too small to hold all the captured substrings,  it  is         If the vector is too small to hold all the captured substring  offsets,
1315         used as far as possible (up to two-thirds of its length), and the func-         it is used as far as possible (up to two-thirds of its length), and the
1316         tion returns a value of zero. In particular, if the  substring  offsets         function returns a value of zero. In particular, if the substring  off-
1317         are  not  of interest, pcre_exec() may be called with ovector passed as         sets are not of interest, pcre_exec() may be called with ovector passed
1318         NULL and ovecsize as zero. However, if the pattern contains back refer-         as NULL and ovecsize as zero. However, if  the  pattern  contains  back
1319         ences  and  the  ovector  isn't big enough to remember the related sub-         references  and  the  ovector is not big enough to remember the related
1320         strings, PCRE has to get additional memory  for  use  during  matching.         substrings, PCRE has to get additional memory for use during  matching.
1321         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1322    
1323         Note  that  pcre_info() can be used to find out how many capturing sub-         Note  that  pcre_info() can be used to find out how many capturing sub-
# Line 1177  MATCHING A PATTERN Line 1325  MATCHING A PATTERN
1325         that  will  allow for n captured substrings, in addition to the offsets         that  will  allow for n captured substrings, in addition to the offsets
1326         of the substring matched by the whole pattern, is (n+1)*3.         of the substring matched by the whole pattern, is (n+1)*3.
1327    
1328       Return values from pcre_exec()
1329    
1330         If pcre_exec() fails, it returns a negative number. The  following  are         If pcre_exec() fails, it returns a negative number. The  following  are
1331         defined in the header file:         defined in the header file:
1332    
# Line 1196  MATCHING A PATTERN Line 1346  MATCHING A PATTERN
1346           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1347    
1348         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1349         to  catch  the case when it is passed a junk pointer. This is the error         to catch the case when it is passed a junk pointer and to detect when a
1350         it gives when the magic number isn't present.         pattern that was compiled in an environment of one endianness is run in
1351           an  environment  with the other endianness. This is the error that PCRE
1352           gives when the magic number is not present.
1353    
1354           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_NODE   (-5)
1355    
# Line 1211  MATCHING A PATTERN Line 1363  MATCHING A PATTERN
1363         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1364         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE gets a block of memory at the start of matching to  use  for  this
1365         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose.  If the call via pcre_malloc() fails, this error is given. The
1366         memory is freed at the end of matching.         memory is automatically freed at the end of matching.
1367    
1368           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1369    
# Line 1242  MATCHING A PATTERN Line 1394  MATCHING A PATTERN
1394         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
1395         ter.         ter.
1396    
1397             PCRE_ERROR_PARTIAL (-12)
1398    
1399           The  subject  string did not match, but it did match partially. See the
1400           pcrepartial documentation for details of partial matching.
1401    
1402             PCRE_ERROR_BAD_PARTIAL (-13)
1403    
1404           The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
1405           items  that are not supported for partial matching. See the pcrepartial
1406           documentation for details of partial matching.
1407    
1408             PCRE_ERROR_INTERNAL (-14)
1409    
1410           An unexpected internal error has occurred. This error could  be  caused
1411           by a bug in PCRE or by overwriting of the compiled pattern.
1412    
1413             PCRE_ERROR_BADCOUNT (-15)
1414    
1415           This  error is given if the value of the ovecsize argument is negative.
1416    
1417    
1418  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1419    
# Line 1256  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1428  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1428         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
1429              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
1430    
1431         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
1432         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
1433         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1434         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
1435         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
1436         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
1437         substrings.  A  substring  that  contains  a  binary  zero is correctly         substrings. A substring  that  contains  a  binary  zero  is  correctly
1438         extracted and has a further zero added on the end, but  the  result  is         extracted  and  has  a further zero added on the end, but the result is
1439         not, of course, a C string.         not, of course, a C string.
1440    
1441         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
1442         tions: subject is the subject string which has just  been  successfully         tions:  subject  is  the subject string that has just been successfully
1443         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
1444         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
1445         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
1446         entire regular expression. This is the value returned by  pcre_exec  if         entire regular expression. This is the value returned by pcre_exec() if
1447         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
1448         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
1449         be the size of the vector divided by three.         be the number of elements in the vector divided by three.
1450    
1451         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
1452         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
1453         zero  extracts  the  substring  that  matched the entire pattern, while         zero extracts the substring that matched the  entire  pattern,  whereas
1454         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
1455         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
1456         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
1457         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
1458         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
1459         the terminating zero, or one of         the terminating zero, or one of
1460    
1461           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1462    
1463         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
1464         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
1465    
1466           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1467    
1468         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
1469    
1470         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
1471         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
1472         single block of memory which is obtained via pcre_malloc.  The  address         single block of memory that is obtained via pcre_malloc. The address of
1473         of the memory block is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
1474         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
1475         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all went well, or
1476    
1477           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1478    
1479         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
1480    
1481         When  any of these functions encounter a substring that is unset, which         When any of these functions encounter a substring that is unset,  which
1482         can happen when capturing subpattern number n+1 matches  some  part  of         can  happen  when  capturing subpattern number n+1 matches some part of
1483         the  subject, but subpattern n has not been used at all, they return an         the subject, but subpattern n has not been used at all, they return  an
1484         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
1485         string  by inspecting the appropriate offset in ovector, which is nega-         string by inspecting the appropriate offset in ovector, which is  nega-
1486         tive for unset substrings.         tive for unset substrings.
1487    
1488         The    two    convenience    functions    pcre_free_substring()     and         The  two convenience functions pcre_free_substring() and pcre_free_sub-
1489         pcre_free_substring_list() can be used to free the memory returned by a         string_list() can be used to free the memory  returned  by  a  previous
1490         previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1491         respectively. They do nothing more than call the function pointed to by         tively. They do nothing more than  call  the  function  pointed  to  by
1492         pcre_free, which of course could be called directly from a  C  program.         pcre_free,  which  of course could be called directly from a C program.
1493         However,  PCRE is used in some situations where it is linked via a spe-         However, PCRE is used in some situations where it is linked via a  spe-
1494         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  which  cannot  use
1495         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free directly; it is  for  these  cases  that  the  functions  are
1496         vided.         provided.
1497    
1498    
1499  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1500    
1501           int pcre_get_stringnumber(const pcre *code,
1502                const char *name);
1503    
1504         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
1505              const char *subject, int *ovector,              const char *subject, int *ovector,
1506              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1507              char *buffer, int buffersize);              char *buffer, int buffersize);
1508    
        int pcre_get_stringnumber(const pcre *code,  
             const char *name);  
   
1509         int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
1510              const char *subject, int *ovector,              const char *subject, int *ovector,
1511              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1512              const char **stringptr);              const char **stringptr);
1513    
1514         To extract a substring by name, you first have to find associated  num-         To  extract a substring by name, you first have to find associated num-
1515         ber.  This  can  be  done by calling pcre_get_stringnumber(). The first         ber.  For example, for this pattern
1516         argument is the compiled pattern, and the second is the name. For exam-  
1517         ple, for this pattern           (a+)b(?<xxx>\d+)...
1518    
1519           ab(?<xxx>\d+)...         the number of the subpattern called "xxx" is 2. You can find the number
1520           from the name by calling pcre_get_stringnumber(). The first argument is
1521         the  number  of the subpattern called "xxx" is 1. Given the number, you         the compiled pattern, and the second is the  name.  The  yield  of  the
1522         can then extract the substring directly, or use one  of  the  functions         function  is  the  subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if
1523         described  in the previous section. For convenience, there are also two         there is no subpattern of that name.
1524         functions that do the whole job.  
1525           Given the number, you can extract the substring directly, or use one of
1526           the functions described in the previous section. For convenience, there
1527           are also two functions that do the whole job.
1528    
1529         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
1530         pcre_get_named_substring() are the same as those for the functions that         pcre_get_named_substring()  are  the  same  as  those for the similarly
1531         extract by number, and so are not re-described here. There are just two         named functions that extract by number. As these are described  in  the
1532         differences.         previous  section,  they  are not re-described here. There are just two
1533           differences:
1534    
1535         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
1536         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
1537         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
1538         name-to-number translation table.         name-to-number translation table.
1539    
1540         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
1541         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
1542         ate.         ate.
1543    
1544  Last updated: 09 December 2003  Last updated: 09 September 2004
1545  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
1546  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1547    
1548  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 1392  PCRE CALLOUTS Line 1568  PCRE CALLOUTS
1568         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
1569         points:         points:
1570    
1571           (?C1)abc(?C2)def           (?C1)eabc(?C2)def
1572    
1573         During matching, when PCRE reaches a callout point (and pcre_callout is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
1574         set), the external function is called. Its only argument is  a  pointer         called, PCRE automatically  inserts  callouts,  all  with  number  255,
1575         to a pcre_callout block. This contains the following variables:         before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
1576           used with the pattern
1577    
1578             A(\d{2}|--)
1579    
1580           it is processed as if it were
1581    
1582           (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
1583    
1584           Notice that there is a callout before and after  each  parenthesis  and
1585           alternation  bar.  Automatic  callouts  can  be  used  for tracking the
1586           progress of pattern matching. The pcretest command has an  option  that
1587           sets  automatic callouts; when it is used, the output indicates how the
1588           pattern is matched. This is useful information when you are  trying  to
1589           optimize the performance of a particular pattern.
1590    
1591    
1592    MISSING CALLOUTS
1593    
1594           You  should  be  aware  that,  because of optimizations in the way PCRE
1595           matches patterns, callouts sometimes do not happen. For example, if the
1596           pattern is
1597    
1598             ab(?C4)cd
1599    
1600           PCRE knows that any matching string must contain the letter "d". If the
1601           subject string is "abyz", the lack of "d" means that  matching  doesn't
1602           ever  start,  and  the  callout is never reached. However, with "abyd",
1603           though the result is still no match, the callout is obeyed.
1604    
1605    
1606    THE CALLOUT INTERFACE
1607    
1608           During matching, when PCRE reaches a callout point, the external  func-
1609           tion  defined  by pcre_callout is called (if it is set). The only argu-
1610           ment is a pointer to a pcre_callout block. This structure contains  the
1611           following fields:
1612    
1613           int          version;           int          version;
1614           int          callout_number;           int          callout_number;
# Line 1408  PCRE CALLOUTS Line 1620  PCRE CALLOUTS
1620           int          capture_top;           int          capture_top;
1621           int          capture_last;           int          capture_last;
1622           void        *callout_data;           void        *callout_data;
1623             int          pattern_position;
1624             int          next_item_length;
1625    
1626         The  version  field  is an integer containing the version number of the         The  version  field  is an integer containing the version number of the
1627         block format. The current version  is  zero.  The  version  number  may         block format. The initial version was 0; the current version is 1.  The
1628         change  in  future if additional fields are added, but the intention is         version  number  will  change  again in future if additional fields are
1629         never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
1630    
1631         The callout_number field contains the number of the  callout,  as  com-         The  callout_number  field  contains the number of the callout, as com-
1632         piled into the pattern (that is, the number after ?C).         piled into the pattern (that is, the number after ?C for  manual  call-
1633           outs, and 255 for automatically generated callouts).
1634    
1635         The  offset_vector field is a pointer to the vector of offsets that was         The  offset_vector field is a pointer to the vector of offsets that was
1636         passed by the caller to pcre_exec(). The contents can be  inspected  in         passed by the caller to pcre_exec(). The contents can be  inspected  in
1637         order  to extract substrings that have been matched so far, in the same         order  to extract substrings that have been matched so far, in the same
1638         way as for extracting substrings after a match has completed.         way as for extracting substrings after a match has completed.
1639    
1640         The subject and subject_length fields contain copies  the  values  that         The subject and subject_length fields contain copies of the values that
1641         were passed to pcre_exec().         were passed to pcre_exec().
1642    
1643         The  start_match  field contains the offset within the subject at which         The  start_match  field contains the offset within the subject at which
1644         the current match attempt started. If the pattern is not anchored,  the         the current match attempt started. If the pattern is not anchored,  the
1645         callout  function  may  be  called several times for different starting         callout function may be called several times from the same point in the
1646         points.         pattern for different starting points in the subject.
1647    
1648         The current_position field contains the offset within  the  subject  of         The current_position field contains the offset within  the  subject  of
1649         the current match pointer.         the current match pointer.
1650    
1651         The  capture_top field contains one more than the number of the highest         The  capture_top field contains one more than the number of the highest
1652         numbered  captured  substring  so  far.  If  no  substrings  have  been         numbered captured substring so far. If no  substrings  have  been  cap-
1653         captured, the value of capture_top is one.         tured, the value of capture_top is one.
1654    
1655         The  capture_last  field  contains the number of the most recently cap-         The  capture_last  field  contains the number of the most recently cap-
1656         tured substring.         tured substring. If no substrings have been captured, its value is  -1.
1657    
1658         The callout_data field contains a value that is passed  to  pcre_exec()         The  callout_data  field contains a value that is passed to pcre_exec()
1659         by  the  caller specifically so that it can be passed back in callouts.         by the caller specifically so that it can be passed back  in  callouts.
1660         It is passed in the pcre_callout field of the  pcre_extra  data  struc-         It  is  passed  in the pcre_callout field of the pcre_extra data struc-
1661         ture.  If  no  such  data  was  passed,  the value of callout_data in a         ture. If no such data was  passed,  the  value  of  callout_data  in  a
1662         pcre_callout block is NULL. There is a description  of  the  pcre_extra         pcre_callout  block  is  NULL. There is a description of the pcre_extra
1663         structure in the pcreapi documentation.         structure in the pcreapi documentation.
1664    
1665           The pattern_position field is present from version 1 of the  pcre_call-
1666           out structure. It contains the offset to the next item to be matched in
1667           the pattern string.
1668    
1669           The next_item_length field is present from version 1 of the  pcre_call-
1670           out structure. It contains the length of the next item to be matched in
1671           the pattern string. When the callout immediately precedes  an  alterna-
1672           tion  bar, a closing parenthesis, or the end of the pattern, the length
1673           is zero. When the callout precedes an opening parenthesis,  the  length
1674           is that of the entire subpattern.
1675    
1676           The  pattern_position  and next_item_length fields are intended to help
1677           in distinguishing between different automatic callouts, which all  have
1678           the same callout number. However, they are set for all callouts.
1679    
1680    
1681  RETURN VALUES  RETURN VALUES
1682    
1683         The callout function returns an integer. If the value is zero, matching         The  external callout function returns an integer to PCRE. If the value
1684         proceeds as normal. If the value is greater than zero,  matching  fails         is zero, matching proceeds as normal. If  the  value  is  greater  than
1685         at the current point, but backtracking to test other possibilities goes         zero,  matching  fails  at  the current point, but backtracking to test
1686         ahead, just as if a lookahead assertion had failed.  If  the  value  is         other matching possibilities goes ahead, just as if a lookahead  asser-
1687         less  than  zero,  the  match is abandoned, and pcre_exec() returns the         tion  had  failed.  If  the value is less than zero, the match is aban-
1688         value.         doned, and pcre_exec() returns the negative value.
1689    
1690         Negative  values  should  normally  be   chosen   from   the   set   of         Negative  values  should  normally  be   chosen   from   the   set   of
1691         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
# Line 1464  RETURN VALUES Line 1693  RETURN VALUES
1693         reserved  for  use  by callout functions; it will never be used by PCRE         reserved  for  use  by callout functions; it will never be used by PCRE
1694         itself.         itself.
1695    
1696  Last updated: 21 January 2003  Last updated: 09 September 2004
1697  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
1698  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1699    
1700  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 1475  PCRE(3) Line 1704  PCRE(3)
1704  NAME  NAME
1705         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
1706    
1707  DIFFERENCES FROM PERL  DIFFERENCES BETWEEN PCRE AND PERL
1708    
1709         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
1710         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
# Line 1498  DIFFERENCES FROM PERL Line 1727  DIFFERENCES FROM PERL
1727    
1728         4. Though binary zero characters are supported in the  subject  string,         4. Though binary zero characters are supported in the  subject  string,
1729         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
1730         mal C string, terminated by zero. The escape sequence "\0" can be  used         mal C string, terminated by zero. The escape sequence \0 can be used in
1731         in the pattern to represent a binary zero.         the pattern to represent a binary zero.
1732    
1733         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
1734         \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general         \U, and \N. In fact these are implemented by Perl's general string-han-
1735         string-handling and are not part of its pattern matching engine. If any         dling  and are not part of its pattern matching engine. If any of these
1736         of these are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
1737    
1738         6. PCRE does support the \Q...\E escape for quoting substrings. Charac-         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
1739         ters  in  between  are  treated as literals. This is slightly different         is  built  with Unicode character property support. The properties that
1740         from Perl in that $ and @ are  also  handled  as  literals  inside  the         can be tested with \p and \P are limited to the general category  prop-
1741         quotes.  In Perl, they cause variable interpolation (but of course PCRE         erties such as Lu and Nd.
1742    
1743           7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
1744           ters in between are treated as literals.  This  is  slightly  different
1745           from  Perl  in  that  $  and  @ are also handled as literals inside the
1746           quotes. In Perl, they cause variable interpolation (but of course  PCRE
1747         does not have variables). Note the following examples:         does not have variables). Note the following examples:
1748    
1749             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 1519  DIFFERENCES FROM PERL Line 1753  DIFFERENCES FROM PERL
1753             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1754             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1755    
1756         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
1757         classes.         classes.
1758    
1759         7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
1760         constructions. However, there is some experimental support  for  recur-         constructions.  However,  there is support for recursive patterns using
1761         sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).         the non-Perl items (?R),  (?number),  and  (?P>name).  Also,  the  PCRE
1762         Also, the PCRE "callout" feature allows  an  external  function  to  be         "callout"  feature allows an external function to be called during pat-
1763         called during pattern matching.         tern matching. See the pcrecallout documentation for details.
1764    
1765         8.  There  are some differences that are concerned with the settings of         9. There are some differences that are concerned with the  settings  of
1766         captured strings when part of  a  pattern  is  repeated.  For  example,         captured  strings  when  part  of  a  pattern is repeated. For example,
1767         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
1768         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
1769    
1770         9. PCRE  provides  some  extensions  to  the  Perl  regular  expression         10. PCRE provides some extensions to the Perl regular expression facil-
1771         facilities:         ities:
1772    
1773         (a)  Although  lookbehind  assertions  must match fixed length strings,         (a) Although lookbehind assertions must  match  fixed  length  strings,
1774         each alternative branch of a lookbehind assertion can match a different         each alternative branch of a lookbehind assertion can match a different
1775         length of string. Perl requires them all to have the same length.         length of string. Perl requires them all to have the same length.
1776    
1777         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
1778         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
1779    
1780         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
1781         cial meaning is faulted.         cial meaning is faulted.
1782    
1783         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
1784         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
1785         lowed by a question mark they are.         lowed by a question mark they are.
1786    
1787         (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
1788         the first matching position in the subject string.         tried only at the first matching position in the subject string.
1789    
1790         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
1791         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
1792    
1793         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive
1794         pattern matching (Perl can do  this  using  the  (?p{code})  construct,         pattern  matching  (Perl  can  do  this using the (?p{code}) construct,
1795         which PCRE cannot support.)         which PCRE cannot support.)
1796    
1797         (h)  PCRE supports named capturing substrings, using the Python syntax.         (h) PCRE supports named capturing substrings, using the Python  syntax.
1798    
1799         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from
1800         Sun's Java package.         Sun's Java package.
1801    
1802         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) The (R) condition, for testing recursion, is a PCRE extension.
1803    
1804         (k) The callout facility is PCRE-specific.         (k) The callout facility is PCRE-specific.
1805    
1806  Last updated: 09 December 2003         (l) The partial matching facility is PCRE-specific.
1807  Copyright (c) 1997-2003 University of Cambridge.  
1808           (m) Patterns compiled by PCRE can be saved and re-used at a later time,
1809           even on different hosts that have the other endianness.
1810    
1811    Last updated: 09 September 2004
1812    Copyright (c) 1997-2004 University of Cambridge.
1813  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1814    
1815  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 1584  PCRE REGULAR EXPRESSION DETAILS Line 1823  PCRE REGULAR EXPRESSION DETAILS
1823    
1824         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax  and semantics of the regular expressions supported by PCRE
1825         are described below. Regular expressions are also described in the Perl         are described below. Regular expressions are also described in the Perl
1826         documentation  and in a number of other books, some of which have copi-         documentation  and  in  a  number  of books, some of which have copious
1827         ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-         examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
1828         lished  by  O'Reilly, covers them in great detail. The description here         by  O'Reilly, covers regular expressions in great detail. This descrip-
1829         is intended as reference documentation.         tion of PCRE's regular expressions is intended as reference material.
1830    
1831         The basic operation of PCRE is on strings of bytes. However,  there  is         The original operation of PCRE was on strings of  one-byte  characters.
1832         also  support for UTF-8 character strings. To use this support you must         However,  there is now also support for UTF-8 character strings. To use
1833         build PCRE to include UTF-8 support, and then call pcre_compile()  with         this, you must build PCRE to  include  UTF-8  support,  and  then  call
1834         the  PCRE_UTF8  option.  How  this affects the pattern matching is men-         pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
1835         tioned in several places below. There is also a summary of  UTF-8  fea-         matching is mentioned in several places below. There is also a  summary
1836         tures in the section on UTF-8 support in the main pcre page.         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
1837           page.
1838         A  regular  expression  is  a pattern that is matched against a subject  
1839         string from left to right. Most characters stand for  themselves  in  a         A regular expression is a pattern that is  matched  against  a  subject
1840         pattern,  and  match  the corresponding characters in the subject. As a         string  from  left  to right. Most characters stand for themselves in a
1841           pattern, and match the corresponding characters in the  subject.  As  a
1842         trivial example, the pattern         trivial example, the pattern
1843    
1844           The quick brown fox           The quick brown fox
1845    
1846         matches a portion of a subject string that is identical to itself.  The         matches  a portion of a subject string that is identical to itself. The
1847         power of regular expressions comes from the ability to include alterna-         power of regular expressions comes from the ability to include alterna-
1848         tives and repetitions in the pattern. These are encoded in the  pattern         tives  and repetitions in the pattern. These are encoded in the pattern
1849         by  the  use  of meta-characters, which do not stand for themselves but         by the use of metacharacters, which do not  stand  for  themselves  but
1850         instead are interpreted in some special way.         instead are interpreted in some special way.
1851    
1852         There are two different sets of meta-characters: those that are  recog-         There  are  two different sets of metacharacters: those that are recog-
1853         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
1854         that are recognized in square brackets. Outside  square  brackets,  the         that  are  recognized  in square brackets. Outside square brackets, the
1855         meta-characters are as follows:         metacharacters are as follows:
1856    
1857           \      general escape character with several uses           \      general escape character with several uses
1858           ^      assert start of string (or line, in multiline mode)           ^      assert start of string (or line, in multiline mode)
# Line 1630  PCRE REGULAR EXPRESSION DETAILS Line 1870  PCRE REGULAR EXPRESSION DETAILS
1870                  also "possessive quantifier"                  also "possessive quantifier"
1871           {      start min/max quantifier           {      start min/max quantifier
1872    
1873         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
1874         class". In a character class the only meta-characters are:         class". In a character class the only metacharacters are:
1875    
1876           \      general escape character           \      general escape character
1877           ^      negate the class, but only if the first character           ^      negate the class, but only if the first character
# Line 1640  PCRE REGULAR EXPRESSION DETAILS Line 1880  PCRE REGULAR EXPRESSION DETAILS
1880                    syntax)                    syntax)
1881           ]      terminates the character class           ]      terminates the character class
1882    
1883         The following sections describe the use of each of the meta-characters.         The  following sections describe the use of each of the metacharacters.
1884    
1885    
1886  BACKSLASH  BACKSLASH
1887    
1888         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
1889         a non-alphameric character, it takes  away  any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
1890         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
1891         applies both inside and outside character classes.         applies both inside and outside character classes.
1892    
1893         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
1894         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
1895         character would otherwise be interpreted as a meta-character, so it  is         character  would  otherwise be interpreted as a metacharacter, so it is
1896         always  safe to precede a non-alphameric with backslash to specify that         always safe to precede a non-alphanumeric  with  backslash  to  specify
1897         it stands for itself. In particular, if you want to match a  backslash,         that  it stands for itself. In particular, if you want to match a back-
1898         you write \\.         slash, you write \\.
1899    
1900         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
1901         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
1902         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline character are ignored.
1903         An escaping backslash can be used to include a whitespace or #  charac-         An  escaping backslash can be used to include a whitespace or # charac-
1904         ter as part of the pattern.         ter as part of the pattern.
1905    
1906         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
1907         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
1908         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
1909         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
1910         tion. Note the following examples:         tion. Note the following examples:
1911    
1912           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 1676  BACKSLASH Line 1916  BACKSLASH
1916           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1917           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1918    
1919         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
1920         classes.         classes.
1921    
1922       Non-printing characters
1923    
1924         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
1925         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
1926         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
1927         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
1928         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
1929         sequences than the binary character it represents:         sequences than the binary character it represents:
1930    
1931           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 1697  BACKSLASH Line 1939  BACKSLASH
1939           \xhh      character with hex code hh           \xhh      character with hex code hh
1940           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1941    
1942         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
1943         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
1944         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
1945         becomes hex 7B.         becomes hex 7B.
1946    
1947         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
1948         in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-         in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-
1949         its may appear between \x{ and }, but the value of the  character  code         its  may  appear between \x{ and }, but the value of the character code
1950         must  be  less  than  2**31  (that is, the maximum hexadecimal value is         must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is
1951         7FFFFFFF). If characters other than hexadecimal digits  appear  between         7FFFFFFF).  If  characters other than hexadecimal digits appear between
1952         \x{  and }, or if there is no terminating }, this form of escape is not         \x{ and }, or if there is no terminating }, this form of escape is  not
1953         recognized. Instead, the initial \x will be interpreted as a basic hex-         recognized. Instead, the initial \x will be interpreted as a basic hex-
1954         adecimal escape, with no following digits, giving a byte whose value is         adecimal escape, with no following digits,  giving  a  character  whose
1955         zero.         value is zero.
1956    
1957         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
1958         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference         two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference
1959         in the way they are handled. For example, \xdc is exactly the  same  as         in  the  way they are handled. For example, \xdc is exactly the same as
1960         \x{dc}.         \x{dc}.
1961    
1962         After  \0  up  to  two further octal digits are read. In both cases, if         After \0 up to two further octal digits are read.  In  both  cases,  if
1963         there are fewer than two digits, just those that are present are  used.         there  are fewer than two digits, just those that are present are used.
1964         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL         Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL
1965         character (code value 7). Make sure you supply  two  digits  after  the         character  (code  value  7).  Make sure you supply two digits after the
1966         initial zero if the character that follows is itself an octal digit.         initial zero if the pattern character that follows is itself  an  octal
1967           digit.
1968    
1969         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
1970         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
# Line 1758  BACKSLASH Line 2001  BACKSLASH
2001         All the sequences that define a single byte value  or  a  single  UTF-8         All the sequences that define a single byte value  or  a  single  UTF-8
2002         character (in UTF-8 mode) can be used both inside and outside character         character (in UTF-8 mode) can be used both inside and outside character
2003         classes. In addition, inside a character  class,  the  sequence  \b  is         classes. In addition, inside a character  class,  the  sequence  \b  is
2004         interpreted  as  the  backspace character (hex 08). Outside a character         interpreted as the backspace character (hex 08), and the sequence \X is
2005         class it has a different meaning (see below).         interpreted as the character "X".  Outside  a  character  class,  these
2006           sequences have different meanings (see below).
2007    
2008       Generic character types
2009    
2010         The third use of backslash is for specifying generic character types:         The  third  use of backslash is for specifying generic character types.
2011           The following are always recognized:
2012    
2013           \d     any decimal digit           \d     any decimal digit
2014           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
# Line 1774  BACKSLASH Line 2021  BACKSLASH
2021         into  two disjoint sets. Any given character matches one, and only one,         into  two disjoint sets. Any given character matches one, and only one,
2022         of each pair.         of each pair.
2023    
2024         In UTF-8 mode, characters with values greater than 255 never match  \d,         These character type sequences can appear both inside and outside char-
2025         \s, or \w, and always match \D, \S, and \W.         acter  classes.  They each match one character of the appropriate type.
2026           If the current matching point is at the end of the subject string,  all
2027           of them fail, since there is no character to match.
2028    
2029         For  compatibility  with Perl, \s does not match the VT character (code         For  compatibility  with Perl, \s does not match the VT character (code
2030         11).  This makes it different from the the POSIX "space" class. The  \s         11).  This makes it different from the the POSIX "space" class. The  \s
2031         characters are HT (9), LF (10), FF (12), CR (13), and space (32).         characters are HT (9), LF (10), FF (12), CR (13), and space (32).
2032    
2033         A  "word" character is any letter or digit or the underscore character,         A "word" character is an underscore or any character less than 256 that
2034         that is, any character which can be part of a Perl "word". The  defini-         is a letter or digit. The definition of  letters  and  digits  is  con-
2035         tion  of  letters  and digits is controlled by PCRE's character tables,         trolled  by PCRE's low-valued character tables, and may vary if locale-
2036         and may vary if locale- specific matching is taking place (see  "Locale         specific matching is taking place (see "Locale support" in the  pcreapi
2037         support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)         page).  For  example,  in  the  "fr_FR" (French) locale, some character
2038         locale, some character codes greater than 128  are  used  for  accented         codes greater than 128 are used for accented  letters,  and  these  are
2039         letters, and these are matched by \w.         matched by \w.
2040    
2041           In  UTF-8 mode, characters with values greater than 128 never match \d,
2042           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2043           code character property support is available.
2044    
2045       Unicode character properties
2046    
2047           When PCRE is built with Unicode character property support, three addi-
2048           tional escape sequences to match generic character types are  available
2049           when UTF-8 mode is selected. They are:
2050    
2051            \p{xx}   a character with the xx property
2052            \P{xx}   a character without the xx property
2053            \X       an extended Unicode sequence
2054    
2055           The  property  names represented by xx above are limited to the Unicode
2056           general category properties. Each character has exactly one such  prop-
2057           erty,  specified  by  a two-letter abbreviation. For compatibility with
2058           Perl, negation can be specified by including a circumflex  between  the
2059           opening  brace  and the property name. For example, \p{^Lu} is the same
2060           as \P{Lu}.
2061    
2062           If only one letter is specified with \p or  \P,  it  includes  all  the
2063           properties that start with that letter. In this case, in the absence of
2064           negation, the curly brackets in the escape sequence are optional; these
2065           two examples have the same effect:
2066    
2067             \p{L}
2068             \pL
2069    
2070           The following property codes are supported:
2071    
2072             C     Other
2073             Cc    Control
2074             Cf    Format
2075             Cn    Unassigned
2076             Co    Private use
2077             Cs    Surrogate
2078    
2079             L     Letter
2080             Ll    Lower case letter
2081             Lm    Modifier letter
2082             Lo    Other letter
2083             Lt    Title case letter
2084             Lu    Upper case letter
2085    
2086             M     Mark
2087             Mc    Spacing mark
2088             Me    Enclosing mark
2089             Mn    Non-spacing mark
2090    
2091             N     Number
2092             Nd    Decimal number
2093             Nl    Letter number
2094             No    Other number
2095    
2096             P     Punctuation
2097             Pc    Connector punctuation
2098             Pd    Dash punctuation
2099             Pe    Close punctuation
2100             Pf    Final punctuation
2101             Pi    Initial punctuation
2102             Po    Other punctuation
2103             Ps    Open punctuation
2104    
2105             S     Symbol
2106             Sc    Currency symbol
2107             Sk    Modifier symbol
2108             Sm    Mathematical symbol
2109             So    Other symbol
2110    
2111             Z     Separator
2112             Zl    Line separator
2113             Zp    Paragraph separator
2114             Zs    Space separator
2115    
2116           Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-
2117           ported by PCRE.
2118    
2119           Specifying caseless matching does not affect  these  escape  sequences.
2120           For example, \p{Lu} always matches only upper case letters.
2121    
2122           The  \X  escape  matches  any number of Unicode characters that form an
2123           extended Unicode sequence. \X is equivalent to
2124    
2125             (?>\PM\pM*)
2126    
2127           That is, it matches a character without the "mark"  property,  followed
2128           by  zero  or  more  characters with the "mark" property, and treats the
2129           sequence as an atomic group (see below).  Characters  with  the  "mark"
2130           property are typically accents that affect the preceding character.
2131    
2132           Matching  characters  by Unicode property is not fast, because PCRE has
2133           to search a structure that contains  data  for  over  fifteen  thousand
2134           characters. That is why the traditional escape sequences such as \d and
2135           \w do not use Unicode properties in PCRE.
2136    
2137         These character type sequences can appear both inside and outside char-     Simple assertions
        acter classes. They each match one character of the  appropriate  type.  
        If  the current matching point is at the end of the subject string, all  
        of them fail, since there is no character to match.  
2138    
2139         The fourth use of backslash is for certain simple assertions. An asser-         The fourth use of backslash is for certain simple assertions. An asser-
2140         tion  specifies a condition that has to be met at a particular point in         tion  specifies a condition that has to be met at a particular point in
2141         a match, without consuming any characters from the subject string.  The         a match, without consuming any characters from the subject string.  The
2142         use  of subpatterns for more complicated assertions is described below.         use  of subpatterns for more complicated assertions is described below.
2143         The backslashed assertions are         The backslashed assertions are:
2144    
2145           \b     matches at a word boundary           \b     matches at a word boundary
2146           \B     matches when not at a word boundary           \B     matches when not at a word boundary
# Line 1817  BACKSLASH Line 2159  BACKSLASH
2159         string if the first or last character matches \w, respectively.         string if the first or last character matches \w, respectively.
2160    
2161         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The  \A,  \Z,  and \z assertions differ from the traditional circumflex
2162         and dollar (described below) in that they only ever match at  the  very         and dollar (described in the next section) in that they only ever match
2163         start  and  end  of the subject string, whatever options are set. Thus,         at  the  very start and end of the subject string, whatever options are
2164         they are independent of multiline mode.         set. Thus, they are independent of multiline mode. These  three  asser-
2165           tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
2166         They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the         affect only the behaviour of the circumflex and dollar  metacharacters.
2167         startoffset argument of pcre_exec() is non-zero, indicating that match-         However,  if the startoffset argument of pcre_exec() is non-zero, indi-
2168         ing is to start at a point other than the beginning of the subject,  \A         cating that matching is to start at a point other than the beginning of
2169         can  never  match.  The difference between \Z and \z is that \Z matches         the  subject,  \A  can never match. The difference between \Z and \z is
2170         before a newline that is the last character of the string as well as at         that \Z matches before a newline that is  the  last  character  of  the
2171         the end of the string, whereas \z matches only at the end.         string  as well as at the end of the string, whereas \z matches only at
2172           the end.
2173         The  \G assertion is true only when the current matching position is at  
2174         the start point of the match, as specified by the startoffset  argument         The \G assertion is true only when the current matching position is  at
2175         of  pcre_exec().  It  differs  from \A when the value of startoffset is         the  start point of the match, as specified by the startoffset argument
2176         non-zero. By calling pcre_exec() multiple times with appropriate  argu-         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
2177           non-zero.  By calling pcre_exec() multiple times with appropriate argu-
2178         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
2179         mentation where \G can be useful.         mentation where \G can be useful.
2180    
2181         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
2182         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
2183         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
2184         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
2185         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
2186    
2187         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
2188         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
2189         in the compiled regular expression.         in the compiled regular expression.
2190    
# Line 1849  BACKSLASH Line 2192  BACKSLASH
2192  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
2193    
2194         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
2195         character  is  an  assertion which is true only if the current matching         character is an assertion that is true only  if  the  current  matching
2196         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
2197         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
2198         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
2199         has an entirely different meaning (see below).         has an entirely different meaning (see below).
2200    
2201         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
2202         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
2203         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
2204         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
2205         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
2206         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
2207         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
2208    
2209         A  dollar  character  is an assertion which is true only if the current         A dollar character is an assertion that is true  only  if  the  current
2210         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
2211         before a newline character that is the last character in the string (by         before a newline character that is the last character in the string (by
2212         default). Dollar need not be the last character of  the  pattern  if  a         default).  Dollar  need  not  be the last character of the pattern if a
2213         number  of alternatives are involved, but it should be the last item in         number of alternatives are involved, but it should be the last item  in
2214         any branch in which it appears.  Dollar has no  special  meaning  in  a         any  branch  in  which  it appears.  Dollar has no special meaning in a
2215         character class.         character class.
2216    
2217         The  meaning  of  dollar  can be changed so that it matches only at the         The meaning of dollar can be changed so that it  matches  only  at  the
2218         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
2219         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
2220    
2221         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
2222         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When this is the case, they match immedi-
2223         ately  after  and  immediately  before  an  internal newline character,         ately after and  immediately  before  an  internal  newline  character,
2224         respectively, in addition to matching at the start and end of the  sub-         respectively,  in addition to matching at the start and end of the sub-
2225         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject         ject string. For example,  the  pattern  /^abc$/  matches  the  subject
2226         string "def\nabc" in multiline mode, but not  otherwise.  Consequently,         string  "def\nabc"  (where \n represents a newline character) in multi-
2227         patterns  that  are  anchored  in single line mode because all branches         line mode, but not otherwise.  Consequently, patterns that are anchored
2228         start with ^ are not anchored in multiline mode, and a match  for  cir-         in  single line mode because all branches start with ^ are not anchored
2229         cumflex  is  possible  when  the startoffset argument of pcre_exec() is         in multiline mode, and a match for  circumflex  is  possible  when  the
2230         non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE         startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-
2231         is set.         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
2232    
2233         Note  that  the sequences \A, \Z, and \z can be used to match the start         Note that the sequences \A, \Z, and \z can be used to match  the  start
2234         and end of the subject in both modes, and if all branches of a  pattern         and  end of the subject in both modes, and if all branches of a pattern
2235         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or         start with \A it is always anchored, whether PCRE_MULTILINE is  set  or
2236         not.         not.
2237    
2238    
2239  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
2240    
2241         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2242         ter  in  the  subject,  including a non-printing character, but not (by         ter in the subject, including a non-printing  character,  but  not  (by
2243         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,         default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,
2244         which  might  be  more than one byte long, except (by default) for new-         which might be more than one byte long, except (by default) newline. If
2245         line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.         the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-
2246         The  handling of dot is entirely independent of the handling of circum-         dling of dot is entirely independent of the handling of circumflex  and
2247         flex and dollar, the only relationship being  that  they  both  involve         dollar,  the  only  relationship  being  that they both involve newline
2248         newline characters. Dot has no special meaning in a character class.         characters. Dot has no special meaning in a character class.
2249    
2250    
2251  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
2252    
2253         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
2254         both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-         both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.
2255         line.  The  feature  is  provided  in Perl in order to match individual         The feature is provided in Perl in order to match individual  bytes  in
2256         bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-         UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual
2257         vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8         bytes, what remains in the string may be a malformed UTF-8 string.  For
2258         string. For this reason it is best avoided.         this reason, the \C escape sequence is best avoided.
2259    
2260         PCRE does not allow \C to appear in lookbehind assertions (see  below),         PCRE  does  not  allow \C to appear in lookbehind assertions (described
2261         because in UTF-8 mode it makes it impossible to calculate the length of         below), because in UTF-8 mode this would make it impossible  to  calcu-
2262         the lookbehind.         late the length of the lookbehind.
2263    
2264    
2265  SQUARE BRACKETS  SQUARE BRACKETS AND CHARACTER CLASSES
2266    
2267         An opening square bracket introduces a character class, terminated by a         An opening square bracket introduces a character class, terminated by a
2268         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
2269         cial. If a closing square bracket is required as a member of the class,         cial. If a closing square bracket is required as a member of the class,
2270         it  should  be  the first data character in the class (after an initial         it should be the first data character in the class  (after  an  initial
2271         circumflex, if present) or escaped with a backslash.         circumflex, if present) or escaped with a backslash.
2272    
2273         A character class matches a single character in the subject.  In  UTF-8         A  character  class matches a single character in the subject. In UTF-8
2274         mode,  the character may occupy more than one byte. A matched character         mode, the character may occupy more than one byte. A matched  character
2275         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
2276         character  in  the  class definition is a circumflex, in which case the         character in the class definition is a circumflex, in  which  case  the
2277         subject character must not be in the set defined by  the  class.  If  a         subject  character  must  not  be in the set defined by the class. If a
2278         circumflex  is actually required as a member of the class, ensure it is         circumflex is actually required as a member of the class, ensure it  is
2279         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
2280    
2281         For example, the character class [aeiou] matches any lower case  vowel,         For  example, the character class [aeiou] matches any lower case vowel,
2282         while  [^aeiou]  matches  any character that is not a lower case vowel.         while [^aeiou] matches any character that is not a  lower  case  vowel.
2283         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
2284         characters which are in the class by enumerating those that are not. It         characters that are in the class by enumerating those that are  not.  A
2285         is not an assertion: it still consumes a  character  from  the  subject         class  that starts with a circumflex is not an assertion: it still con-
2286         string, and fails if the current pointer is at the end of the string.         sumes a character from the subject string, and therefore  it  fails  if
2287           the current pointer is at the end of the string.
2288    
2289         In  UTF-8 mode, characters with values greater than 255 can be included         In  UTF-8 mode, characters with values greater than 255 can be included
2290         in a class as a literal string of bytes, or by using the  \x{  escaping         in a class as a literal string of bytes, or by using the  \x{  escaping
# Line 1949  SQUARE BRACKETS Line 2293  SQUARE BRACKETS
2293         When  caseless  matching  is set, any letters in a class represent both         When  caseless  matching  is set, any letters in a class represent both
2294         their upper case and lower case versions, so for  example,  a  caseless         their upper case and lower case versions, so for  example,  a  caseless
2295         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
2296         match "A", whereas a caseful version would. PCRE does not  support  the         match "A", whereas a caseful version would. When running in UTF-8 mode,
2297         concept of case for characters with values greater than 255.         PCRE  supports  the  concept of case for characters with values greater
2298           than 128 only when it is compiled with Unicode property support.
2299    
2300         The  newline character is never treated in any special way in character         The newline character is never treated in any special way in  character
2301         classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
2302         options is. A class such as [^a] will always match a newline.         options is. A class such as [^a] will always match a newline.
2303    
2304         The  minus (hyphen) character can be used to specify a range of charac-         The minus (hyphen) character can be used to specify a range of  charac-
2305         ters in a character  class.  For  example,  [d-m]  matches  any  letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
2306         between  d  and  m,  inclusive.  If  a minus character is required in a         between d and m, inclusive. If a  minus  character  is  required  in  a
2307         class, it must be escaped with a backslash  or  appear  in  a  position         class,  it  must  be  escaped  with a backslash or appear in a position
2308         where  it cannot be interpreted as indicating a range, typically as the         where it cannot be interpreted as indicating a range, typically as  the
2309         first or last character in the class.         first or last character in the class.
2310    
2311         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
2312         ter  of a range. A pattern such as [W-]46] is interpreted as a class of         ter of a range. A pattern such as [W-]46] is interpreted as a class  of
2313         two characters ("W" and "-") followed by a literal string "46]", so  it         two  characters ("W" and "-") followed by a literal string "46]", so it
2314         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
2315         backslash it is interpreted as the end of range, so [W-\]46] is  inter-         backslash  it is interpreted as the end of range, so [W-\]46] is inter-
2316         preted  as  a  single class containing a range followed by two separate         preted as a class containing a range followed by two other  characters.
2317         characters. The octal or hexadecimal representation of "]" can also  be         The  octal or hexadecimal representation of "]" can also be used to end
2318         used to end a range.         a range.
2319    
2320         Ranges  operate in the collating sequence of character values. They can         Ranges operate in the collating sequence of character values. They  can
2321         also  be  used  for  characters  specified  numerically,  for   example         also   be  used  for  characters  specified  numerically,  for  example
2322         [\000-\037].  In UTF-8 mode, ranges can include characters whose values         [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
2323         are greater than 255, for example [\x{100}-\x{2ff}].         are greater than 255, for example [\x{100}-\x{2ff}].
2324    
2325         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
2326         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
2327         to [][\^_`wxyzabc], matched caselessly, and if character tables for the         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
2328         "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
2329         both cases.         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
2330           concept of case for characters with values greater than 128  only  when
2331         The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a         it is compiled with Unicode property support.
2332         character  class,  and add the characters that they match to the class.  
2333         For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can         The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
2334         conveniently  be  used with the upper case character types to specify a         in a character class, and add the characters that  they  match  to  the
2335         more restricted set of characters than the matching  lower  case  type.         class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
2336         For  example,  the  class  [^\W_]  matches any letter or digit, but not         flex can conveniently be used with the upper case  character  types  to
2337         underscore.         specify  a  more  restricted  set of characters than the matching lower
2338           case type. For example, the class [^\W_] matches any letter  or  digit,
2339         All non-alphameric characters other than \, -, ^ (at the start) and the         but not underscore.
2340         terminating ] are non-special in character classes, but it does no harm  
2341         if they are escaped.         The  only  metacharacters  that are recognized in character classes are
2342           backslash, hyphen (only where it can be  interpreted  as  specifying  a
2343           range),  circumflex  (only  at the start), opening square bracket (only
2344           when it can be interpreted as introducing a POSIX class name - see  the
2345           next  section),  and  the  terminating closing square bracket. However,
2346           escaping other non-alphanumeric characters does no harm.
2347    
2348    
2349  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
2350    
2351         Perl supports the POSIX notation  for  character  classes,  which  uses         Perl supports the POSIX notation for character classes. This uses names
2352         names  enclosed by [: and :] within the enclosing square brackets. PCRE         enclosed  by  [: and :] within the enclosing square brackets. PCRE also
2353         also supports this notation. For example,         supports this notation. For example,
2354    
2355           [01[:alpha:]%]           [01[:alpha:]%]
2356    
# Line 2037  POSIX CHARACTER CLASSES Line 2387  POSIX CHARACTER CLASSES
2387         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2388         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
2389    
2390         In UTF-8 mode, characters with values greater than 255 do not match any         In UTF-8 mode, characters with values greater than 128 do not match any
2391         of the POSIX character classes.         of the POSIX character classes.
2392    
2393    
# Line 2104  INTERNAL OPTION SETTING Line 2454  INTERNAL OPTION SETTING
2454         in the same way as the Perl-compatible options by using the  characters         in the same way as the Perl-compatible options by using the  characters
2455         U  and X respectively. The (?X) flag setting is special in that it must         U  and X respectively. The (?X) flag setting is special in that it must
2456         always occur earlier in the pattern than any of the additional features         always occur earlier in the pattern than any of the additional features
2457         it turns on, even when it is at top level. It is best put at the start.         it  turns on, even when it is at top level. It is best to put it at the
2458           start.
2459    
2460    
2461  SUBPATTERNS  SUBPATTERNS
2462    
2463         Subpatterns are delimited by parentheses (round brackets), which can be         Subpatterns are delimited by parentheses (round brackets), which can be
2464         nested.  Marking part of a pattern as a subpattern does two things:         nested.  Turning part of a pattern into a subpattern does two things:
2465    
2466         1. It localizes a set of alternatives. For example, the pattern         1. It localizes a set of alternatives. For example, the pattern
2467    
# Line 2120  SUBPATTERNS Line 2471  SUBPATTERNS
2471         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the parentheses, it would match "cataract",  "erpillar"  or  the  empty
2472         string.         string.
2473    
2474         2.  It  sets  up  the  subpattern as a capturing subpattern (as defined         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
2475         above).  When the whole pattern matches, that portion  of  the  subject         that, when the whole pattern  matches,  that  portion  of  the  subject
2476         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
2477         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
2478         left  to right (starting from 1) to obtain the numbers of the capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
2479         subpatterns.         subpatterns.
2480    
2481         For example, if the string "the red king" is matched against  the  pat-         For example, if the string "the red king" is matched against  the  pat-
# Line 2169  NAMED SUBPATTERNS Line 2520  NAMED SUBPATTERNS
2520         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying  capturing  parentheses  by number is simple, but it can be
2521         very hard to keep track of the numbers in complicated  regular  expres-         very hard to keep track of the numbers in complicated  regular  expres-
2522         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
2523         change. To help with the difficulty, PCRE supports the naming  of  sub-         change. To help with this difficulty, PCRE supports the naming of  sub-
2524         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns,  something  that  Perl  does  not  provide. The Python syntax
2525         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
2526         underscores, and must be unique within a pattern.         underscores, and must be unique within a pattern.
2527    
2528         Named  capturing  parentheses  are  still  allocated numbers as well as         Named  capturing  parentheses  are  still  allocated numbers as well as
2529         names. The PCRE API provides function calls for extracting the name-to-         names. The PCRE API provides function calls for extracting the name-to-
2530         number  translation  table from a compiled pattern. For further details         number  translation table from a compiled pattern. There is also a con-
2531         see the pcreapi documentation.         venience function for extracting a captured substring by name. For fur-
2532           ther details see the pcreapi documentation.
2533    
2534    
2535  REPETITION  REPETITION
2536    
2537         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
2538         following items:         following items:
2539    
2540           a literal data character           a literal data character
2541           the . metacharacter           the . metacharacter
2542           the \C escape sequence           the \C escape sequence
2543           escapes such as \d that match single characters           the \X escape sequence (in UTF-8 mode with Unicode properties)
2544             an escape such as \d that matches a single character
2545           a character class           a character class
2546           a back reference (see next section)           a back reference (see next section)
2547           a parenthesized subpattern (unless it is an assertion)           a parenthesized subpattern (unless it is an assertion)
2548    
2549         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
2550         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
2551         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
2552         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
2553    
2554           z{2,4}           z{2,4}
2555    
2556         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
2557         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
2558         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
2559         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
2560         matches. Thus         matches. Thus
2561    
2562           [aeiou]{3,}           [aeiou]{3,}
# Line 2212  REPETITION Line 2565  REPETITION
2565    
2566           \d{8}           \d{8}
2567    
2568         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
2569         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
2570         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
2571         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
2572    
2573         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
2574         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
2575         acters, each of which is represented by a two-byte sequence.         acters, each of which is represented by a two-byte sequence. Similarly,
2576           when Unicode property support is available, \X{3} matches three Unicode
2577           extended  sequences,  each of which may be several bytes long (and they
2578           may be of different lengths).
2579    
2580         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
2581         the previous item and the quantifier were not present.         the previous item and the quantifier were not present.
# Line 2247  REPETITION Line 2603  REPETITION
2603         as  possible  (up  to  the  maximum number of permitted times), without         as  possible  (up  to  the  maximum number of permitted times), without
2604         causing the rest of the pattern to fail. The classic example  of  where         causing the rest of the pattern to fail. The classic example  of  where
2605         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
2606         appear between the sequences /* and */ and within the  sequence,  indi-         appear between /* and */ and within the comment,  individual  *  and  /
2607         vidual * and / characters may appear. An attempt to match C comments by         characters  may  appear. An attempt to match C comments by applying the
2608         applying the pattern         pattern
2609    
2610           /\*.*\*/           /\*.*\*/
2611    
2612         to the string         to the string
2613    
2614           /* first command */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
2615    
2616         fails, because it matches the entire string owing to the greediness  of         fails, because it matches the entire string owing to the greediness  of
2617         the .*  item.         the .*  item.
# Line 2283  REPETITION Line 2639  REPETITION
2639         words, it inverts the default behaviour.         words, it inverts the default behaviour.
2640    
2641         When  a  parenthesized  subpattern  is quantified with a minimum repeat         When  a  parenthesized  subpattern  is quantified with a minimum repeat
2642         count that is greater than 1 or with a limited maximum, more  store  is         count that is greater than 1 or with a limited maximum, more memory  is
2643         required  for  the  compiled  pattern, in proportion to the size of the         required  for  the  compiled  pattern, in proportion to the size of the
2644         minimum or maximum.         minimum or maximum.
2645    
# Line 2374  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 2730  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
2730         consists  of  an  additional  + character following a quantifier. Using         consists  of  an  additional  + character following a quantifier. Using
2731         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
2732    
2733           \d++bar           \d++foo
2734    
2735         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
2736         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
# Line 2399  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 2755  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
2755           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2756    
2757         it takes a long time before reporting  failure.  This  is  because  the         it takes a long time before reporting  failure.  This  is  because  the
2758         string  can  be  divided  between  the two repeats in a large number of         string  can be divided between the internal \D+ repeat and the external
2759         ways, and all have to be tried. (The example used [!?]  rather  than  a         * repeat in a large number of ways, and all  have  to  be  tried.  (The
2760         single  character  at the end, because both PCRE and Perl have an opti-         example  uses  [!?]  rather than a single character at the end, because
2761         mization that allows for fast failure when a single character is  used.         both PCRE and Perl have an optimization that allows  for  fast  failure
2762         They  remember  the last single character that is required for a match,         when  a single character is used. They remember the last single charac-
2763         and fail early if it is not present in the string.)  If the pattern  is         ter that is required for a match, and fail early if it is  not  present
2764         changed to         in  the  string.)  If  the pattern is changed so that it uses an atomic
2765           group, like this:
2766    
2767           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
2768    
2769         sequences  of non-digits cannot be broken, and failure happens quickly.         sequences of non-digits cannot be broken, and failure happens  quickly.
2770    
2771    
2772  BACK REFERENCES  BACK REFERENCES
2773    
2774         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
2775         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
2776         pattern earlier (that is, to its left) in the pattern,  provided  there         pattern  earlier  (that is, to its left) in the pattern, provided there
2777         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
2778    
2779         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
2780         it is always taken as a back reference, and causes  an  error  only  if         it  is  always  taken  as a back reference, and causes an error only if
2781         there  are  not that many capturing left parentheses in the entire pat-         there are not that many capturing left parentheses in the  entire  pat-
2782         tern. In other words, the parentheses that are referenced need  not  be         tern.  In  other words, the parentheses that are referenced need not be
2783         to  the left of the reference for numbers less than 10. See the section         to the left of the reference for numbers less than 10. See the  subsec-
2784         entitled "Backslash" above for further details of the handling of  dig-         tion  entitled  "Non-printing  characters" above for further details of
2785         its following a backslash.         the handling of digits following a backslash.
2786    
2787         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
2788         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
2789         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
2790         of doing that). So the pattern         of doing that). So the pattern
2791    
2792           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
2793    
2794         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
2795         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
2796         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
2797         ple,         ple,
2798    
2799           ((?i)rah)\s+\1           ((?i)rah)\s+\1
2800    
2801         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
2802         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
2803    
2804         Back references to named subpatterns use the Python  syntax  (?P=name).         Back  references  to named subpatterns use the Python syntax (?P=name).
2805         We could rewrite the above example as follows:         We could rewrite the above example as follows:
2806    
2807           (?<p1>(?i)rah)\s+(?P=p1)           (?<p1>(?i)rah)\s+(?P=p1)
2808    
2809         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
2810         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
2811         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
2812    
2813           (a|(bc))\2           (a|(bc))\2
2814    
2815         always  fails if it starts to match "a" rather than "bc". Because there         always fails if it starts to match "a" rather than "bc". Because  there
2816         may be many capturing parentheses in a pattern,  all  digits  following         may  be  many  capturing parentheses in a pattern, all digits following
2817         the  backslash  are taken as part of a potential back reference number.         the backslash are taken as part of a potential back  reference  number.
2818         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
2819         used  to  terminate  the back reference. If the PCRE_EXTENDED option is         used to terminate the back reference. If the  PCRE_EXTENDED  option  is
2820         set, this can be whitespace.  Otherwise an empty comment can be used.         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
2821           ments" below) can be used.
2822    
2823         A back reference that occurs inside the parentheses to which it  refers         A back reference that occurs inside the parentheses to which it  refers
2824         fails  when  the subpattern is first used, so, for example, (a\1) never         fails  when  the subpattern is first used, so, for example, (a\1) never
# Line 2482  ASSERTIONS Line 2840  ASSERTIONS
2840         An assertion is a test on the characters  following  or  preceding  the         An assertion is a test on the characters  following  or  preceding  the
2841         current  matching  point that does not actually consume any characters.         current  matching  point that does not actually consume any characters.
2842         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
2843         described above.  More complicated assertions are coded as subpatterns.         described above.
2844         There are two kinds: those that look ahead of the current  position  in  
2845         the subject string, and those that look behind it.         More  complicated  assertions  are  coded as subpatterns. There are two
2846           kinds: those that look ahead of the current  position  in  the  subject
2847         An  assertion  subpattern  is matched in the normal way, except that it         string,  and  those  that  look  behind  it. An assertion subpattern is
2848         does not cause the current matching position to be  changed.  Lookahead         matched in the normal way, except that it does not  cause  the  current
2849         assertions  start with (?= for positive assertions and (?! for negative         matching position to be changed.
2850         assertions. For example,  
2851           Assertion  subpatterns  are  not  capturing subpatterns, and may not be
2852           repeated, because it makes no sense to assert the  same  thing  several
2853           times.  If  any kind of assertion contains capturing subpatterns within
2854           it, these are counted for the purposes of numbering the capturing  sub-
2855           patterns in the whole pattern.  However, substring capturing is carried
2856           out only for positive assertions, because it does not  make  sense  for
2857           negative assertions.
2858    
2859       Lookahead assertions
2860    
2861           Lookahead assertions start with (?= for positive assertions and (?! for
2862           negative assertions. For example,
2863    
2864           \w+(?=;)           \w+(?=;)
2865    
# Line 2506  ASSERTIONS Line 2876  ASSERTIONS
2876         does not find an occurrence of "bar"  that  is  preceded  by  something         does not find an occurrence of "bar"  that  is  preceded  by  something
2877         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
2878         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
2879         "bar". A lookbehind assertion is needed to achieve this effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
2880    
2881         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
2882         most convenient way to do it is  with  (?!)  because  an  empty  string         most convenient way to do it is  with  (?!)  because  an  empty  string
2883         always  matches, so an assertion that requires there not to be an empty         always  matches, so an assertion that requires there not to be an empty
2884         string must always fail.         string must always fail.
2885    
2886       Lookbehind assertions
2887    
2888         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
2889         for negative assertions. For example,         for negative assertions. For example,
2890    
# Line 2551  ASSERTIONS Line 2923  ASSERTIONS
2923    
2924         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
2925         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
2926         ble to calculate the length of the lookbehind.         ble to calculate the length of the lookbehind. The \X escape, which can
2927           match different numbers of bytes, is also not permitted.
2928    
2929         Atomic groups can be used in conjunction with lookbehind assertions  to         Atomic  groups can be used in conjunction with lookbehind assertions to
2930         specify efficient matching at the end of the subject string. Consider a         specify efficient matching at the end of the subject string. Consider a
2931         simple pattern such as         simple pattern such as
2932    
2933           abcd$           abcd$
2934    
2935         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
2936         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
2937         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
2938         pattern is specified as         pattern is specified as
2939    
2940           ^.*abcd$           ^.*abcd$
2941    
2942         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
2943         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
2944         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
2945         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
2946         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
2947    
2948           ^(?>.*)(?<=abcd)           ^(?>.*)(?<=abcd)
2949    
2950         or, equivalently,         or, equivalently, using the possessive quantifier syntax,
2951    
2952           ^.*+(?<=abcd)           ^.*+(?<=abcd)
2953    
2954         there  can  be  no  backtracking for the .* item; it can match only the         there can be no backtracking for the .* item; it  can  match  only  the
2955         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
2956         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
2957         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
2958         processing time.         processing time.
2959    
2960       Using multiple assertions
2961    
2962         Several assertions (of any sort) may occur in succession. For example,         Several assertions (of any sort) may occur in succession. For example,
2963    
2964           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
2965    
2966         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
2967         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
2968         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
2969         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
2970         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
2971         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
2972         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
2973         foo". A pattern to do that is         foo". A pattern to do that is
2974    
2975           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
2976    
2977         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
2978         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
2979         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
2980    
# Line 2607  ASSERTIONS Line 2982  ASSERTIONS
2982    
2983           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
2984    
2985         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
2986         is not preceded by "foo", while         is not preceded by "foo", while
2987    
2988           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
2989    
2990         is another pattern which matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
2991         three characters that are not "999".         three characters that are not "999".
2992    
        Assertion subpatterns are not capturing subpatterns,  and  may  not  be  
        repeated,  because  it  makes no sense to assert the same thing several  
        times. If any kind of assertion contains capturing  subpatterns  within  
        it,  these are counted for the purposes of numbering the capturing sub-  
        patterns in the whole pattern.  However, substring capturing is carried  
        out  only  for  positive assertions, because it does not make sense for  
        negative assertions.  
   
2993    
2994  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
2995    
2996         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
2997         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
2998         on the  result  of  an  assertion,  or  whether  a  previous  capturing         on  the result of an assertion, or whether a previous capturing subpat-
2999         subpattern  matched  or not. The two possible forms of conditional sub-         tern matched or not. The two possible forms of  conditional  subpattern
3000         pattern are         are
3001    
3002           (?(condition)yes-pattern)           (?(condition)yes-pattern)
3003           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
3004    
3005         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
3006         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
3007         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
3008    
3009         There are three kinds of condition. If the text between the parentheses         There are three kinds of condition. If the text between the parentheses
3010         consists  of  a  sequence  of digits, the condition is satisfied if the         consists of a sequence of digits, the condition  is  satisfied  if  the
3011         capturing subpattern of that number has previously matched. The  number         capturing  subpattern of that number has previously matched. The number
3012         must  be  greater than zero. Consider the following pattern, which con-         must be greater than zero. Consider the following pattern,  which  con-
3013         tains non-significant white space to make it more readable (assume  the         tains  non-significant white space to make it more readable (assume the
3014         PCRE_EXTENDED  option)  and  to  divide it into three parts for ease of         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
3015         discussion:         discussion:
3016    
3017           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
3018    
3019         The first part matches an optional opening  parenthesis,  and  if  that         The  first  part  matches  an optional opening parenthesis, and if that
3020         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
3021         ond part matches one or more characters that are not  parentheses.  The         ond  part  matches one or more characters that are not parentheses. The
3022         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether the first set
3023         of parentheses matched or not. If they did, that is, if subject started         of parentheses matched or not. If they did, that is, if subject started
3024         with an opening parenthesis, the condition is true, and so the yes-pat-         with an opening parenthesis, the condition is true, and so the yes-pat-
3025         tern is executed and a  closing  parenthesis  is  required.  Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
3026         since  no-pattern  is  not  present, the subpattern matches nothing. In         since no-pattern is not present, the  subpattern  matches  nothing.  In
3027         other words,  this  pattern  matches  a  sequence  of  non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
3028         optionally enclosed in parentheses.         optionally enclosed in parentheses.
3029    
3030         If the condition is the string (R), it is satisfied if a recursive call         If the condition is the string (R), it is satisfied if a recursive call
3031         to the pattern or subpattern has been made. At "top level", the  condi-         to  the pattern or subpattern has been made. At "top level", the condi-
3032         tion  is  false.   This  is  a  PCRE  extension. Recursive patterns are         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
3033         described in the next section.         described in the next section.
3034    
3035         If the condition is not a sequence of digits or  (R),  it  must  be  an         If  the  condition  is  not  a sequence of digits or (R), it must be an
3036         assertion.   This may be a positive or negative lookahead or lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
3037         assertion. Consider  this  pattern,  again  containing  non-significant         assertion.  Consider  this  pattern,  again  containing non-significant
3038         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
3039    
3040           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
3041           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
3042    
3043         The  condition  is  a  positive  lookahead  assertion  that  matches an         The condition  is  a  positive  lookahead  assertion  that  matches  an
3044         optional sequence of non-letters followed by a letter. In other  words,         optional  sequence of non-letters followed by a letter. In other words,
3045         it  tests  for the presence of at least one letter in the subject. If a         it tests for the presence of at least one letter in the subject.  If  a
3046         letter is found, the subject is matched against the first  alternative;         letter  is found, the subject is matched against the first alternative;
3047         otherwise  it  is  matched  against  the  second.  This pattern matches         otherwise it is  matched  against  the  second.  This  pattern  matches
3048         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
3049         letters and dd are digits.         letters and dd are digits.
3050    
3051    
3052  COMMENTS  COMMENTS
3053    
3054         The sequence (?# marks the start of a comment which continues up to the         The sequence (?# marks the start of a comment that continues up to  the
3055         next closing parenthesis. Nested parentheses  are  not  permitted.  The         next  closing  parenthesis.  Nested  parentheses are not permitted. The
3056         characters  that make up a comment play no part in the pattern matching         characters that make up a comment play no part in the pattern  matching
3057         at all.         at all.
3058    
3059         If the PCRE_EXTENDED option is set, an unescaped # character outside  a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
3060         character class introduces a comment that continues up to the next new-         character class introduces a comment that continues up to the next new-
3061         line character in the pattern.         line character in the pattern.
3062    
3063    
3064  RECURSIVE PATTERNS  RECURSIVE PATTERNS
3065    
3066         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
3067         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
3068         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
3069         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
3070         depth. Perl has provided an experimental facility that  allows  regular         depth.  Perl  provides  a  facility  that allows regular expressions to
3071         expressions to recurse (amongst other things). It does this by interpo-         recurse (amongst other things). It does this by interpolating Perl code
3072         lating Perl code in the expression at run time, and the code can  refer         in the expression at run time, and the code can refer to the expression
3073         to the expression itself. A Perl pattern to solve the parentheses prob-         itself. A Perl pattern to solve the parentheses problem can be  created
3074         lem can be created like this:         like this:
3075    
3076           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3077    
3078         The (?p{...}) item interpolates Perl code at run time, and in this case         The (?p{...}) item interpolates Perl code at run time, and in this case
3079         refers  recursively to the pattern in which it appears. Obviously, PCRE         refers recursively to the pattern in which it appears. Obviously,  PCRE
3080         cannot support the interpolation of Perl  code.  Instead,  it  supports         cannot  support  the  interpolation  of Perl code. Instead, it supports
3081         some  special  syntax for recursion of the entire pattern, and also for         some special syntax for recursion of the entire pattern, and  also  for
3082         individual subpattern recursion.         individual subpattern recursion.
3083    
3084         The special item that consists of (? followed by a number greater  than         The  special item that consists of (? followed by a number greater than
3085         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
3086         the given number, provided that it occurs inside that  subpattern.  (If         the  given  number, provided that it occurs inside that subpattern. (If
3087         not,  it  is  a  "subroutine" call, which is described in the next sec-         not, it is a "subroutine" call, which is described  in  the  next  sec-
3088         tion.) The special item (?R) is a recursive call of the entire  regular         tion.)  The special item (?R) is a recursive call of the entire regular
3089         expression.         expression.
3090    
3091         For  example,  this  PCRE pattern solves the nested parentheses problem         For example, this PCRE pattern solves the  nested  parentheses  problem
3092         (assume the  PCRE_EXTENDED  option  is  set  so  that  white  space  is         (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is
3093         ignored):         ignored):
3094    
3095           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
3096    
3097         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
3098         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
3099         recursive  match  of  the pattern itself (that is a correctly parenthe-         recursive match of the pattern itself (that is  a  correctly  parenthe-
3100         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
3101    
3102         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
3103         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
3104    
3105           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
3106    
3107         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
3108         refer to them instead of the whole pattern. In a larger pattern,  keep-         refer  to them instead of the whole pattern. In a larger pattern, keep-
3109         ing  track  of parenthesis numbers can be tricky. It may be more conve-         ing track of parenthesis numbers can be tricky. It may be  more  conve-
3110         nient to use named parentheses instead. For this, PCRE uses  (?P>name),         nient  to use named parentheses instead. For this, PCRE uses (?P>name),
3111         which  is  an  extension  to the Python syntax that PCRE uses for named         which is an extension to the Python syntax that  PCRE  uses  for  named
3112         parentheses (Perl does not provide named parentheses). We could rewrite         parentheses (Perl does not provide named parentheses). We could rewrite
3113         the above example as follows:         the above example as follows:
3114    
3115           (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )           (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
3116    
3117         This  particular example pattern contains nested unlimited repeats, and         This particular example pattern contains nested unlimited repeats,  and
3118         so the use of atomic grouping for matching strings  of  non-parentheses         so  the  use of atomic grouping for matching strings of non-parentheses
3119         is  important  when  applying the pattern to strings that do not match.         is important when applying the pattern to strings that  do  not  match.
3120         For example, when this pattern is applied to         For example, when this pattern is applied to
3121    
3122           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3123    
3124         it yields "no match" quickly. However, if atomic grouping is not  used,         it  yields "no match" quickly. However, if atomic grouping is not used,
3125         the  match  runs  for a very long time indeed because there are so many         the match runs for a very long time indeed because there  are  so  many
3126         different ways the + and * repeats can carve up the  subject,  and  all         different  ways  the  + and * repeats can carve up the subject, and all
3127         have to be tested before failure can be reported.         have to be tested before failure can be reported.
3128    
3129         At the end of a match, the values set for any capturing subpatterns are         At the end of a match, the values set for any capturing subpatterns are
3130         those from the outermost level of the recursion at which the subpattern         those from the outermost level of the recursion at which the subpattern
3131         value  is  set.   If  you want to obtain intermediate values, a callout         value is set.  If you want to obtain  intermediate  values,  a  callout
3132         function can be used (see below and the pcrecallout documentation).  If         function can be used (see the next section and the pcrecallout documen-
3133         the pattern above is matched against         tation). If the pattern above is matched against
3134    
3135           (ab(cd)ef)           (ab(cd)ef)
3136    
3137         the  value  for  the  capturing  parentheses is "ef", which is the last         the value for the capturing parentheses is  "ef",  which  is  the  last
3138         value taken on at the top level. If additional parentheses  are  added,         value  taken  on at the top level. If additional parentheses are added,
3139         giving         giving
3140    
3141           \( ( ( (?>[^()]+) | (?R) )* ) \)           \( ( ( (?>[^()]+) | (?R) )* ) \)
3142              ^                        ^              ^                        ^
3143              ^                        ^              ^                        ^
3144    
3145         the  string  they  capture is "ab(cd)ef", the contents of the top level         the string they capture is "ab(cd)ef", the contents of  the  top  level
3146         parentheses. If there are more than 15 capturing parentheses in a  pat-         parentheses.  If there are more than 15 capturing parentheses in a pat-
3147         tern, PCRE has to obtain extra memory to store data during a recursion,         tern, PCRE has to obtain extra memory to store data during a recursion,
3148         which it does by using pcre_malloc, freeing  it  via  pcre_free  after-         which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
3149         wards.  If  no  memory  can  be  obtained,  the  match  fails  with the         wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
3150         PCRE_ERROR_NOMEMORY error.         PCRE_ERROR_NOMEMORY error.
3151    
3152         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
3153         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
3154         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
3155         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
3156         ted at the outer level.         ted at the outer level.
3157    
3158           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
3159    
3160         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
3161         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
3162         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
3163    
3164    
3165  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
3166    
3167         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
3168         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
3169         ates like a subroutine in a programming language.  An  earlier  example         ates  like  a  subroutine in a programming language. An earlier example
3170         pointed out that the pattern         pointed out that the pattern
3171    
3172           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3173    
3174         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
3175         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
3176    
3177           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
3178    
3179         is used, it does match "sense and responsibility" as well as the  other         is  used, it does match "sense and responsibility" as well as the other
3180         two  strings.  Such  references must, however, follow the subpattern to         two strings. Such references must, however, follow  the  subpattern  to
3181         which they refer.         which they refer.
3182    
3183    
3184  CALLOUTS  CALLOUTS
3185    
3186         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
3187         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
3188         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
3189         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
3190         tion.         tion.
3191    
3192         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
3193         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
3194         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
3195         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
3196         all calling out.         all calling out.
3197    
3198         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
3199         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
3200         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
3201         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
3202         points:         points:
3203    
3204           (?C1)abc(?C2)def           (?C1)abc(?C2)def
3205    
3206           If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
3207           automatically installed before each item in the pattern. They  are  all
3208           numbered 255.
3209    
3210         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
3211         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
3212         of the callout, and, optionally, one item of data  originally  supplied         of  the callout, the position in the pattern, and, optionally, one item
3213         by  the  caller of pcre_exec(). The callout function may cause matching         of data originally supplied by the caller of pcre_exec().  The  callout
3214         to backtrack, or to fail altogether.  A  complete  description  of  the         function  may cause matching to proceed, to backtrack, or to fail alto-
3215         interface  to the callout function is given in the pcrecallout documen-         gether. A complete description of the interface to the callout function
3216         tation.         is given in the pcrecallout documentation.
3217    
3218  Last updated: 03 February 2003  Last updated: 09 September 2004
3219  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
3220    -----------------------------------------------------------------------------
3221    
3222    PCRE(3)                                                                PCRE(3)
3223    
3224    
3225    
3226    NAME
3227           PCRE - Perl-compatible regular expressions
3228    
3229    PARTIAL MATCHING IN PCRE
3230    
3231           In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
3232           pcre_exec() matches as far as it goes, but is too short  to  match  the
3233           entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
3234           where it might be helpful to distinguish this case from other cases  in
3235           which there is no match.
3236    
3237           Consider, for example, an application where a human is required to type
3238           in data for a field with specific formatting requirements.  An  example
3239           might be a date in the form ddmmmyy, defined by this pattern:
3240    
3241             ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
3242    
3243           If the application sees the user's keystrokes one by one, and can check
3244           that what has been typed so far is potentially valid,  it  is  able  to
3245           raise  an  error as soon as a mistake is made, possibly beeping and not
3246           reflecting the character that has been typed. This  immediate  feedback
3247           is  likely  to  be a better user interface than a check that is delayed
3248           until the entire string has been entered.
3249    
3250           PCRE supports the concept of partial matching by means of the PCRE_PAR-
3251           TIAL  option,  which  can be set when calling pcre_exec(). When this is
3252           done,  the   return   code   PCRE_ERROR_NOMATCH   is   converted   into
3253           PCRE_ERROR_PARTIAL  if  at  any  time  during  the matching process the
3254           entire subject string matched part of the pattern. No captured data  is
3255           set when this occurs.
3256    
3257           Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
3258           the last literal byte in a pattern, and abandons  matching  immediately
3259           if  such a byte is not present in the subject string. This optimization
3260           cannot be used for a subject string that might match only partially.
3261    
3262    
3263    RESTRICTED PATTERNS FOR PCRE_PARTIAL
3264    
3265           Because of the way certain internal optimizations  are  implemented  in
3266           PCRE,  the  PCRE_PARTIAL  option  cannot  be  used  with  all patterns.
3267           Repeated single characters such as
3268    
3269             a{2,4}
3270    
3271           and repeated single metasequences such as
3272    
3273             \d+
3274    
3275           are not permitted if the maximum number of occurrences is greater  than
3276           one.  Optional items such as \d? (where the maximum is one) are permit-
3277           ted.  Quantifiers with any values are permitted after  parentheses,  so
3278           the invalid examples above can be coded thus:
3279    
3280             (a){2,4}
3281             (\d)+
3282    
3283           These  constructions  run more slowly, but for the kinds of application
3284           that are envisaged for this facility, this is not felt to  be  a  major
3285           restriction.
3286    
3287           If  PCRE_PARTIAL  is  set  for  a  pattern that does not conform to the
3288           restrictions, pcre_exec() returns the error code  PCRE_ERROR_BADPARTIAL
3289           (-13).
3290    
3291    
3292    EXAMPLE OF PARTIAL MATCHING USING PCRETEST
3293    
3294           If  the  escape  sequence  \P  is  present in a pcretest data line, the
3295           PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
3296           uses the date example quoted above:
3297    
3298               re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
3299             data> 25jun04P
3300              0: 25jun04
3301              1: jun
3302             data> 25dec3P
3303             Partial match
3304             data> 3juP
3305             Partial match
3306             data> 3jujP
3307             No match
3308             data> jP
3309             No match
3310    
3311           The  first  data  string  is  matched completely, so pcretest shows the
3312           matched substrings. The remaining four strings do not  match  the  com-
3313           plete pattern, but the first two are partial matches.
3314    
3315    Last updated: 08 September 2004
3316    Copyright (c) 1997-2004 University of Cambridge.
3317    -----------------------------------------------------------------------------
3318    
3319    PCRE(3)                                                                PCRE(3)
3320    
3321    
3322    
3323    NAME
3324           PCRE - Perl-compatible regular expressions
3325    
3326    SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
3327    
3328           If  you  are running an application that uses a large number of regular
3329           expression patterns, it may be useful to store them  in  a  precompiled
3330           form  instead  of  having to compile them every time the application is
3331           run.  If you are not  using  any  private  character  tables  (see  the
3332           pcre_maketables()  documentation),  this is relatively straightforward.
3333           If you are using private tables, it is a little bit more complicated.
3334    
3335           If you save compiled patterns to a file, you can copy them to a differ-
3336           ent  host  and  run them there. This works even if the new host has the
3337           opposite endianness to the one on which  the  patterns  were  compiled.
3338           There  may  be a small performance penalty, but it should be insignifi-
3339           cant.
3340    
3341    
3342    SAVING A COMPILED PATTERN
3343           The value returned by pcre_compile() points to a single block of memory
3344           that  holds  the compiled pattern and associated data. You can find the
3345           length of this block in bytes by calling pcre_fullinfo() with an  argu-
3346           ment  of  PCRE_INFO_SIZE. You can then save the data in any appropriate
3347           manner. Here is sample code that compiles a pattern and writes it to  a
3348           file. It assumes that the variable fd refers to a file that is open for
3349           output:
3350    
3351             int erroroffset, rc, size;
3352             char *error;
3353             pcre *re;
3354    
3355             re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
3356             if (re == NULL) { ... handle errors ... }
3357             rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
3358             if (rc < 0) { ... handle errors ... }
3359             rc = fwrite(re, 1, size, fd);
3360             if (rc != size) { ... handle errors ... }
3361    
3362           In this example, the bytes  that  comprise  the  compiled  pattern  are
3363           copied  exactly.  Note that this is binary data that may contain any of
3364           the 256 possible byte  values.  On  systems  that  make  a  distinction
3365           between binary and non-binary data, be sure that the file is opened for
3366           binary output.
3367    
3368           If you want to write more than one pattern to a file, you will have  to
3369           devise  a  way of separating them. For binary data, preceding each pat-
3370           tern with its length is probably  the  most  straightforward  approach.
3371           Another  possibility is to write out the data in hexadecimal instead of
3372           binary, one pattern to a line.
3373    
3374           Saving compiled patterns in a file is only one possible way of  storing
3375           them  for later use. They could equally well be saved in a database, or
3376           in the memory of some daemon process that passes them  via  sockets  to
3377           the processes that want them.
3378    
3379           If  the pattern has been studied, it is also possible to save the study
3380           data in a similar way to the compiled  pattern  itself.  When  studying
3381           generates  additional  information, pcre_study() returns a pointer to a
3382           pcre_extra data block. Its format is defined in the section on matching
3383           a  pattern in the pcreapi documentation. The study_data field points to
3384           the binary study data,  and  this  is  what  you  must  save  (not  the
3385           pcre_extra  block itself). The length of the study data can be obtained
3386           by calling pcre_fullinfo() with  an  argument  of  PCRE_INFO_STUDYSIZE.
3387           Remember  to check that pcre_study() did return a non-NULL value before
3388           trying to save the study data.
3389    
3390    
3391    RE-USING A PRECOMPILED PATTERN
3392    
3393           Re-using a precompiled pattern is straightforward. Having  reloaded  it
3394           into main memory, you pass its pointer to pcre_exec() in the usual way.
3395           This should work even on another host, and even if that  host  has  the
3396           opposite endianness to the one where the pattern was compiled.
3397    
3398           However,  if  you  passed a pointer to custom character tables when the
3399           pattern was compiled (the tableptr  argument  of  pcre_compile()),  you
3400           must now pass a similar pointer to pcre_exec(), because the value saved
3401           with the compiled pattern will obviously be  nonsense.  A  field  in  a
3402           pcre_extra()  block is used to pass this data, as described in the sec-
3403           tion on matching a pattern in the pcreapi documentation.
3404    
3405           If you did not provide custom character tables  when  the  pattern  was
3406           compiled,  the  pointer  in  the compiled pattern is NULL, which causes
3407           pcre_exec() to use PCRE's internal tables. Thus, you  do  not  need  to
3408           take any special action at run time in this case.
3409    
3410           If  you  saved study data with the compiled pattern, you need to create
3411           your own pcre_extra data block and set the study_data field to point to
3412           the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA
3413           bit in the flags field to indicate that study  data  is  present.  Then
3414           pass the pcre_extra block to pcre_exec() in the usual way.
3415    
3416    
3417    COMPATIBILITY WITH DIFFERENT PCRE RELEASES
3418    
3419           The  layout  of the control block that is at the start of the data that
3420           makes up a compiled pattern was changed for release 5.0.  If  you  have
3421           any  saved  patterns  that  were compiled with previous releases (not a
3422           facility that was previously advertised), you will  have  to  recompile
3423           them  for  release  5.0. However, from now on, it should be possible to
3424           make changes in a compabible manner.
3425    
3426    Last updated: 10 September 2004
3427    Copyright (c) 1997-2004 University of Cambridge.
3428  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
3429    
3430  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 2862  PCRE PERFORMANCE Line 3441  PCRE PERFORMANCE
3441         like  [aeiou]  than  a set of alternatives such as (a|e|i|o|u). In gen-         like  [aeiou]  than  a set of alternatives such as (a|e|i|o|u). In gen-
3442         eral, the simplest construction that provides the required behaviour is         eral, the simplest construction that provides the required behaviour is
3443         usually  the  most  efficient.  Jeffrey Friedl's book contains a lot of         usually  the  most  efficient.  Jeffrey Friedl's book contains a lot of
3444         discussion about optimizing regular expressions for  efficient  perfor-         useful general discussion  about  optimizing  regular  expressions  for
3445         mance.         efficient  performance. This document contains a few observations about
3446           PCRE.
3447    
3448           Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
3449           slow,  because PCRE has to scan a structure that contains data for over
3450           fifteen thousand characters whenever it needs a  character's  property.
3451           If  you  can  find  an  alternative pattern that does not use character
3452           properties, it will probably be faster.
3453    
3454         When  a  pattern  begins  with .* not in parentheses, or in parentheses         When a pattern begins with .* not in  parentheses,  or  in  parentheses
3455         that are not the subject of a backreference, and the PCRE_DOTALL option         that are not the subject of a backreference, and the PCRE_DOTALL option
3456         is  set, the pattern is implicitly anchored by PCRE, since it can match         is set, the pattern is implicitly anchored by PCRE, since it can  match
3457         only at the start of a subject string. However, if PCRE_DOTALL  is  not         only  at  the start of a subject string. However, if PCRE_DOTALL is not
3458         set,  PCRE  cannot  make this optimization, because the . metacharacter         set, PCRE cannot make this optimization, because  the  .  metacharacter
3459         does not then match a newline, and if the subject string contains  new-         does  not then match a newline, and if the subject string contains new-
3460         lines,  the  pattern may match from the character immediately following         lines, the pattern may match from the character  immediately  following
3461         one of them instead of from the very start. For example, the pattern         one of them instead of from the very start. For example, the pattern
3462    
3463           .*second           .*second
3464    
3465         matches the subject "first\nand second" (where \n stands for a  newline         matches  the subject "first\nand second" (where \n stands for a newline
3466         character),  with the match starting at the seventh character. In order         character), with the match starting at the seventh character. In  order
3467         to do this, PCRE has to retry the match starting after every newline in         to do this, PCRE has to retry the match starting after every newline in
3468         the subject.         the subject.
3469    
3470         If  you  are using such a pattern with subject strings that do not con-         If you are using such a pattern with subject strings that do  not  con-
3471         tain newlines, the best performance is obtained by setting PCRE_DOTALL,         tain newlines, the best performance is obtained by setting PCRE_DOTALL,
3472         or  starting  the pattern with ^.* to indicate explicit anchoring. That         or starting the pattern with ^.* to indicate explicit  anchoring.  That
3473         saves PCRE from having to scan along the subject looking for a  newline         saves  PCRE from having to scan along the subject looking for a newline
3474         to restart at.         to restart at.
3475    
3476         Beware  of  patterns  that contain nested indefinite repeats. These can         Beware of patterns that contain nested indefinite  repeats.  These  can
3477         take a long time to run when applied to a string that does  not  match.         take  a  long time to run when applied to a string that does not match.
3478         Consider the pattern fragment         Consider the pattern fragment
3479    
3480           (a+)*           (a+)*
3481    
3482         This  can  match "aaaa" in 33 different ways, and this number increases         This can match "aaaa" in 33 different ways, and this  number  increases
3483         very rapidly as the string gets longer. (The * repeat can match  0,  1,         very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
3484         2,  3,  or  4  times,  and  for each of those cases other than 0, the +         2, 3, or 4 times, and for each of those  cases  other  than  0,  the  +
3485         repeats can match different numbers of times.) When  the  remainder  of         repeats  can  match  different numbers of times.) When the remainder of
3486         the pattern is such that the entire match is going to fail, PCRE has in         the pattern is such that the entire match is going to fail, PCRE has in
3487         principle to try  every  possible  variation,  and  this  can  take  an         principle  to  try  every  possible  variation,  and  this  can take an
3488         extremely long time.         extremely long time.
3489    
3490         An optimization catches some of the more simple cases such as         An optimization catches some of the more simple cases such as
3491    
3492           (a+)*b           (a+)*b
3493    
3494         where  a  literal  character  follows. Before embarking on the standard         where a literal character follows. Before  embarking  on  the  standard
3495         matching procedure, PCRE checks that there is a "b" later in  the  sub-         matching  procedure,  PCRE  checks  that  there  is  a "b" later in the
3496         ject  string, and if there is not, it fails the match immediately. How-         subject string, and if there is not, it fails  the  match  immediately.
3497         ever, when there is no following literal this  optimization  cannot  be         However, when there is no following literal this optimization cannot be
3498         used. You can see the difference by comparing the behaviour of         used. You can see the difference by comparing the behaviour of
3499    
3500           (a+)*\d           (a+)*\d
3501    
3502         with  the  pattern  above.  The former gives a failure almost instantly         with the pattern above. The former gives  a  failure  almost  instantly
3503         when applied to a whole line of  "a"  characters,  whereas  the  latter         when  applied  to  a  whole  line of "a" characters, whereas the latter
3504         takes an appreciable time with strings longer than about 20 characters.         takes an appreciable time with strings longer than about 20 characters.
3505    
3506  Last updated: 03 February 2003         In many cases, the solution to this kind of performance issue is to use
3507  Copyright (c) 1997-2003 University of Cambridge.         an atomic group or a possessive quantifier.
3508    
3509    Last updated: 09 September 2004
3510    Copyright (c) 1997-2004 University of Cambridge.
3511  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
3512    
3513  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 2929  NAME Line 3518  NAME
3518         PCRE - Perl-compatible regular expressions.         PCRE - Perl-compatible regular expressions.
3519    
3520  SYNOPSIS OF POSIX API  SYNOPSIS OF POSIX API
3521    
3522         #include <pcreposix.h>         #include <pcreposix.h>
3523    
3524         int regcomp(regex_t *preg, const char *pattern,         int regcomp(regex_t *preg, const char *pattern,
# Line 2947  DESCRIPTION Line 3537  DESCRIPTION
3537    
3538         This  set  of  functions provides a POSIX-style API to the PCRE regular         This  set  of  functions provides a POSIX-style API to the PCRE regular
3539         expression package. See the pcreapi documentation for a description  of         expression package. See the pcreapi documentation for a description  of
3540         the native API, which contains additional functionality.         PCRE's native API, which contains additional functionality.
3541    
3542         The functions described here are just wrapper functions that ultimately         The functions described here are just wrapper functions that ultimately
3543         call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the         call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
3544         pcreposix.h  header  file,  and  on  Unix systems the library itself is         pcreposix.h  header  file,  and  on  Unix systems the library itself is
3545         called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the         called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
3546         command  for  linking an application which uses them. Because the POSIX         command  for  linking  an application that uses them. Because the POSIX
3547         functions call the native ones, it is also necessary to add -lpcre.         functions call the native ones, it is also necessary to add -lpcre.
3548    
3549         I have implemented only those option bits that can be reasonably mapped         I have implemented only those option bits that can be reasonably mapped
# Line 2985  COMPILING A PATTERN Line 3575  COMPILING A PATTERN
3575         The  function regcomp() is called to compile a pattern into an internal         The  function regcomp() is called to compile a pattern into an internal
3576         form. The pattern is a C string terminated by a  binary  zero,  and  is         form. The pattern is a C string terminated by a  binary  zero,  and  is
3577         passed  in  the  argument  pattern. The preg argument is a pointer to a         passed  in  the  argument  pattern. The preg argument is a pointer to a
3578         regex_t structure which is used as a base for storing information about         regex_t structure that is used as a base for storing information  about
3579         the compiled expression.         the compiled expression.
3580    
3581         The argument cflags is either zero, or contains one or more of the bits         The argument cflags is either zero, or contains one or more of the bits
# Line 3036  MATCHING NEWLINE CHARACTERS Line 3626  MATCHING NEWLINE CHARACTERS
3626    
3627                                   Default   Change with                                   Default   Change with
3628    
3629           . matches newline          yes      REG_NEWLINE           . matches newline          yes    REG_NEWLINE
3630           newline matches [^a]       yes      REG_NEWLINE           newline matches [^a]       yes    REG_NEWLINE
3631           $ matches \n at end        no       REG_NEWLINE           $ matches \n at end        no     REG_NEWLINE
3632           $ matches \n in middle     no       REG_NEWLINE           $ matches \n in middle     no     REG_NEWLINE
3633           ^ matches \n in middle     no       REG_NEWLINE           ^ matches \n in middle     no     REG_NEWLINE
3634    
3635         PCRE's behaviour is the same as Perl's, except that there is no equiva-         PCRE's behaviour is the same as Perl's, except that there is no equiva-
3636         lent for PCRE_DOLLARENDONLY in Perl. In both PCRE and Perl, there is no         lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
3637         way to stop newline from matching [^a].         no way to stop newline from matching [^a].
3638    
3639         The   default  POSIX  newline  handling  can  be  obtained  by  setting         The   default  POSIX  newline  handling  can  be  obtained  by  setting
3640         PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way  to  make  PCRE         PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
3641         behave exactly as for the REG_NEWLINE action.         behave exactly as for the REG_NEWLINE action.
3642    
3643    
3644  MATCHING A PATTERN  MATCHING A PATTERN
3645    
3646         The  function  regexec() is called to match a pre-compiled pattern preg         The  function  regexec()  is  called  to  match a compiled pattern preg
3647         against a given string, which is terminated by a zero byte, subject  to         against a given string, which is terminated by a zero byte, subject  to
3648         the options in eflags. These can be:         the options in eflags. These can be:
3649    
# Line 3092  ERROR MESSAGES Line 3682  ERROR MESSAGES
3682         tion is the size of buffer needed to hold the whole message.         tion is the size of buffer needed to hold the whole message.
3683    
3684    
3685  STORAGE  MEMORY USAGE
3686    
3687         Compiling a regular expression causes memory to be allocated and  asso-         Compiling a regular expression causes memory to be allocated and  asso-
3688         ciated  with  the preg structure. The function regfree() frees all such         ciated  with  the preg structure. The function regfree() frees all such
# Line 3106  AUTHOR Line 3696  AUTHOR
3696         University Computing Service,         University Computing Service,
3697         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
3698    
3699  Last updated: 03 February 2003  Last updated: 07 September 2004
3700  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
3701  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
3702    
3703  PCRE(3)                                                                PCRE(3)  PCRE(3)                                                                PCRE(3)
# Line 3134  PCRE SAMPLE PROGRAM Line 3724  PCRE SAMPLE PROGRAM
3724         bility  of  matching an empty string. Comments in the code explain what         bility  of  matching an empty string. Comments in the code explain what
3725         is going on.         is going on.
3726    
3727         On a Unix system that has PCRE installed in /usr/local, you can compile         If PCRE is installed in the standard include  and  library  directories
3728         the demonstration program using a command like this:         for  your  system, you should be able to compile the demonstration pro-
3729           gram using this command:
3730    
3731             gcc -o pcredemo pcredemo.c -lpcre
3732    
3733           If PCRE is installed elsewhere, you may need to add additional  options
3734           to  the  command line. For example, on a Unix-like system that has PCRE
3735           installed in /usr/local, you  can  compile  the  demonstration  program
3736           using a command like this:
3737    
3738           gcc -o pcredemo pcredemo.c -I/usr/local/include \           gcc -o pcredemo -I/usr/local/include pcredemo.c \
3739               -L/usr/local/lib -lpcre               -L/usr/local/lib -lpcre
3740    
3741         Then you can run simple tests like this:         Once  you  have  compiled the demonstration program, you can run simple
3742           tests like this:
3743    
3744           ./pcredemo 'cat|dog' 'the cat sat on the mat'           ./pcredemo 'cat|dog' 'the cat sat on the mat'
3745           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3746    
3747         Note  that  there  is  a  much  more comprehensive test program, called         Note that there is a  much  more  comprehensive  test  program,  called
3748         pcretest, which supports  many  more  facilities  for  testing  regular         pcretest,  which  supports  many  more  facilities  for testing regular
3749         expressions and the PCRE library. The pcredemo program is provided as a         expressions and the PCRE library. The pcredemo program is provided as a
3750         simple coding example.         simple coding example.
3751    
3752         On some operating systems (e.g. Solaris) you may get an error like this         On some operating systems (e.g. Solaris), when PCRE is not installed in
3753         when you try to run pcredemo:         the standard library directory, you may get an error like this when you
3754           try to run pcredemo:
3755    
3756           ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or           ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
3757         directory         directory
# Line 3161  PCRE SAMPLE PROGRAM Line 3761  PCRE SAMPLE PROGRAM
3761    
3762           -R/usr/local/lib           -R/usr/local/lib
3763    
3764         to the compile command to get round this problem.         (for example) to the compile command to get round this problem.
3765    
3766  Last updated: 28 January 2003  Last updated: 09 September 2004
3767  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
3768  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
3769    

Legend:
Removed from v.74  
changed lines
  Added in v.75

  ViewVC Help
Powered by ViewVC 1.1.5