/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 71 by nigel, Sat Feb 24 21:40:24 2007 UTC
# Line 1  Line 1 
1    This file contains a concatenation of the PCRE man pages, converted to plain
2    text format for ease of searching with a text editor, or for use on systems
3    that do not have a man page processor. The small individual files that give
4    synopses of each function in the library have not been included. There are
5    separate text files for the pcregrep and pcretest commands.
6    -----------------------------------------------------------------------------
7    
8    NAME
9         PCRE - Perl-compatible regular expressions
10    
11    
12    DESCRIPTION
13    
14         The PCRE library is a set of functions that implement  regu-
15         lar  expression  pattern  matching using the same syntax and
16         semantics as Perl, with just a few differences. The  current
17         implementation  of  PCRE  (release 4.x) corresponds approxi-
18         mately with Perl 5.8, including support  for  UTF-8  encoded
19         strings.    However,  this  support  has  to  be  explicitly
20         enabled; it is not the default.
21    
22         PCRE is written in C and released as a C library. However, a
23         number  of  people  have  written wrappers and interfaces of
24         various kinds. A C++ class is included  in  these  contribu-
25         tions,  which  can  be found in the Contrib directory at the
26         primary FTP site, which is:
27    
28         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30         Details of exactly which Perl  regular  expression  features
31         are  and  are  not  supported  by PCRE are given in separate
32         documents. See the pcrepattern and pcrecompat pages.
33    
34         Some features of PCRE can be included, excluded, or  changed
35         when  the library is built. The pcre_config() function makes
36         it possible for a client  to  discover  which  features  are
37         available.  Documentation  about  building  PCRE for various
38         operating systems can be found in the  README  file  in  the
39         source distribution.
40    
41    
42    USER DOCUMENTATION
43    
44         The user documentation for PCRE has been  split  up  into  a
45         number  of  different sections. In the "man" format, each of
46         these is a separate "man page". In the HTML format, each  is
47         a  separate  page,  linked from the index page. In the plain
48         text format, all the sections are concatenated, for ease  of
49         searching. The sections are as follows:
50    
51           pcre              this document
52           pcreapi           details of PCRE's native API
53           pcrebuild         options for building PCRE
54           pcrecallout       details of the callout feature
55           pcrecompat        discussion of Perl compatibility
56           pcregrep          description of the pcregrep command
57           pcrepattern       syntax and semantics of supported
58                               regular expressions
59           pcreperform       discussion of performance issues
60           pcreposix         the POSIX-compatible API
61           pcresample        discussion of the sample program
62           pcretest          the pcretest testing command
63    
64         In addition, in the "man" and HTML formats, there is a short
65         page  for  each  library function, listing its arguments and
66         results.
67    
68    
69    LIMITATIONS
70    
71         There are some size limitations in PCRE but it is hoped that
72         they will never in practice be relevant.
73    
74         The maximum length of a  compiled  pattern  is  65539  (sic)
75         bytes  if PCRE is compiled with the default internal linkage
76         size of 2. If you want to process regular  expressions  that
77         are  truly  enormous,  you can compile PCRE with an internal
78         linkage size of 3 or 4 (see the README file  in  the  source
79         distribution  and  the pcrebuild documentation for details).
80         If these cases the limit is substantially larger.   However,
81         the speed of execution will be slower.
82    
83         All values in repeating quantifiers must be less than 65536.
84         The maximum number of capturing subpatterns is 65535.
85    
86         There is no limit to the  number  of  non-capturing  subpat-
87         terns,  but  the  maximum  depth  of nesting of all kinds of
88         parenthesized subpattern, including  capturing  subpatterns,
89         assertions, and other types of subpattern, is 200.
90    
91         The maximum length of a subject string is the largest  posi-
92         tive number that an integer variable can hold. However, PCRE
93         uses recursion to handle subpatterns and indefinite  repeti-
94         tion.  This  means  that the available stack space may limit
95         the size of a subject string that can be processed  by  cer-
96         tain patterns.
97    
98    
99    UTF-8 SUPPORT
100    
101         Starting at release 3.3, PCRE has had some support for char-
102         acter  strings  encoded in the UTF-8 format. For release 4.0
103         this has been greatly extended to cover most common require-
104         ments.
105    
106         In order process UTF-8  strings,  you  must  build  PCRE  to
107         include  UTF-8  support  in  the code, and, in addition, you
108         must call pcre_compile() with  the  PCRE_UTF8  option  flag.
109         When  you  do this, both the pattern and any subject strings
110         that are matched against it are  treated  as  UTF-8  strings
111         instead of just strings of bytes.
112    
113         If you compile PCRE with UTF-8 support, but do not use it at
114         run  time,  the  library will be a bit bigger, but the addi-
115         tional run time overhead is limited to testing the PCRE_UTF8
116         flag in several places, so should not be very large.
117    
118         The following comments apply when PCRE is running  in  UTF-8
119         mode:
120    
121         1. When you set the PCRE_UTF8 flag, the  strings  passed  as
122         patterns  and  subjects are checked for validity on entry to
123         the relevant  functions.  If  an  invalid  UTF-8  string  is
124         passed,  an  error  return is given. In some situations, you
125         may already know that your strings are valid, and  therefore
126         want  to  skip these checks in order to improve performance.
127         If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
128         run  time,  PCRE  assumes  that the pattern or subject it is
129         given (respectively) contains only  valid  UTF-8  codes.  In
130         this  case, it does not diagnose an invalid UTF-8 string. If
131         you  pass   an   invalid   UTF-8   string   to   PCRE   when
132         PCRE_NO_UTF8_CHECK  is  set, the results are undefined. Your
133         program may crash.
134    
135         2. In a pattern, the escape sequence \x{...}, where the con-
136         tents  of  the  braces is a string of hexadecimal digits, is
137         interpreted as a UTF-8 character whose code  number  is  the
138         given  hexadecimal  number, for example: \x{1234}. If a non-
139         hexadecimal digit appears between the braces,  the  item  is
140         not  recognized.  This escape sequence can be used either as
141         a literal, or within a character class.
142    
143         3. The original hexadecimal escape sequence, \xhh, matches a
144         two-byte UTF-8 character if the value is greater than 127.
145    
146         4. Repeat quantifiers apply to  complete  UTF-8  characters,
147         not to individual bytes, for example: \x{100}{3}.
148    
149         5. The dot metacharacter matches one UTF-8 character instead
150         of a single byte.
151    
152         6. The escape sequence \C can be used to match a single byte
153         in UTF-8 mode, but its use can lead to some strange effects.
154    
155         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W
156         correctly test characters of any code value, but the charac-
157         ters that PCRE recognizes as digits, spaces, or word charac-
158         ters  remain  the  same  set as before, all with values less
159         than 256.
160    
161         8. Case-insensitive  matching  applies  only  to  characters
162         whose  values  are  less than 256. PCRE does not support the
163         notion of "case" for higher-valued characters.
164    
165         9. PCRE does not support the use of Unicode tables and  pro-
166         perties or the Perl escapes \p, \P, and \X.
167    
168    
169    AUTHOR
170    
171         Philip Hazel <ph10@cam.ac.uk>
172         University Computing Service,
173         Cambridge CB2 3QG, England.
174         Phone: +44 1223 334714
175    
176    Last updated: 20 August 2003
177    Copyright (c) 1997-2003 University of Cambridge.
178    -----------------------------------------------------------------------------
179    
180    NAME
181         PCRE - Perl-compatible regular expressions
182    
183    
184    PCRE BUILD-TIME OPTIONS
185    
186         This document describes the optional features of  PCRE  that
187         can  be  selected when the library is compiled. They are all
188         selected, or deselected, by providing options to the config-
189         ure  script  which  is run before the make command. The com-
190         plete list of options  for  configure  (which  includes  the
191         standard  ones  such  as  the  selection of the installation
192         directory) can be obtained by running
193    
194           ./configure --help
195    
196         The following sections describe certain options whose  names
197         begin  with  --enable  or  --disable. These settings specify
198         changes to the defaults for the configure  command.  Because
199         of  the  way  that  configure  works, --enable and --disable
200         always come in pairs, so  the  complementary  option  always
201         exists  as  well, but as it specifies the default, it is not
202         described.
203    
204    
205    UTF-8 SUPPORT
206    
207         To build PCRE with support for UTF-8 character strings, add
208    
209           --enable-utf8
210    
211         to the configure command. Of itself, this does not make PCRE
212         treat  strings as UTF-8. As well as compiling PCRE with this
213         option, you also have have to set the PCRE_UTF8 option  when
214         you call the pcre_compile() function.
215    
216    
217    CODE VALUE OF NEWLINE
218    
219         By default, PCRE treats character 10 (linefeed) as the  new-
220         line  character.  This  is  the  normal newline character on
221         Unix-like systems. You can compile PCRE to use character  13
222         (carriage return) instead by adding
223    
224           --enable-newline-is-cr
225    
226         to the configure command. For completeness there is  also  a
227         --enable-newline-is-lf  option,  which  explicitly specifies
228         linefeed as the newline character.
229    
230    
231    BUILDING SHARED AND STATIC LIBRARIES
232    
233         The PCRE building process uses libtool to build both  shared
234         and  static  Unix libraries by default. You can suppress one
235         of these by adding one of
236    
237           --disable-shared
238           --disable-static
239    
240         to the configure command, as required.
241    
242    
243    POSIX MALLOC USAGE
244    
245         When PCRE is called through the  POSIX  interface  (see  the
246         pcreposix  documentation),  additional  working  storage  is
247         required for holding the pointers  to  capturing  substrings
248         because  PCRE requires three integers per substring, whereas
249         the POSIX interface provides only  two.  If  the  number  of
250         expected  substrings  is  small,  the  wrapper function uses
251         space on the stack, because this is faster than  using  mal-
252         loc()  for  each call. The default threshold above which the
253         stack is no longer used is 10; it can be changed by adding a
254         setting such as
255    
256           --with-posix-malloc-threshold=20
257    
258         to the configure command.
259    
260    
261    LIMITING PCRE RESOURCE USAGE
262    
263         Internally, PCRE has a  function  called  match()  which  it
264         calls  repeatedly  (possibly  recursively) when performing a
265         matching operation. By limiting the  number  of  times  this
266         function  may  be  called,  a  limit  can  be  placed on the
267         resources used by a single call to  pcre_exec().  The  limit
268         can  be  changed  at  run  time, as described in the pcreapi
269         documentation. The default is 10 million, but  this  can  be
270         changed by adding a setting such as
271    
272           --with-match-limit=500000
273    
274         to the configure command.
275    
276    
277    HANDLING VERY LARGE PATTERNS
278    
279         Within a compiled pattern, offset values are used  to  point
280         from  one  part  to  another  (for  example, from an opening
281         parenthesis to an  alternation  metacharacter).  By  default
282         two-byte  values  are  used  for these offsets, leading to a
283         maximum size for a compiled pattern of around 64K.  This  is
284         sufficient  to  handle  all  but the most gigantic patterns.
285         Nevertheless, some people do want to process  enormous  pat-
286         terns,  so  it is possible to compile PCRE to use three-byte
287         or four-byte offsets by adding a setting such as
288    
289           --with-link-size=3
290    
291         to the configure command. The value given must be 2,  3,  or
292         4.  Using  longer  offsets  slows down the operation of PCRE
293         because it has to load additional bytes when handling them.
294    
295         If you build PCRE with an increased link size, test  2  (and
296         test 5 if you are using UTF-8) will fail. Part of the output
297         of these tests is a representation of the compiled  pattern,
298         and this changes with the link size.
299    
300    Last updated: 21 January 2003
301    Copyright (c) 1997-2003 University of Cambridge.
302    -----------------------------------------------------------------------------
303    
304  NAME  NAME
305       pcre - Perl-compatible regular expressions.       PCRE - Perl-compatible regular expressions
306    
307    
308    SYNOPSIS OF PCRE API
309    
 SYNOPSIS  
310       #include <pcre.h>       #include <pcre.h>
311    
312       pcre *pcre_compile(const char *pattern, int options,       pcre *pcre_compile(const char *pattern, int options,
# Line 17  SYNOPSIS Line 320  SYNOPSIS
320            const char *subject, int length, int startoffset,            const char *subject, int length, int startoffset,
321            int options, int *ovector, int ovecsize);            int options, int *ovector, int ovecsize);
322    
323         int pcre_copy_named_substring(const pcre *code,
324              const char *subject, int *ovector,
325              int stringcount, const char *stringname,
326              char *buffer, int buffersize);
327    
328       int pcre_copy_substring(const char *subject, int *ovector,       int pcre_copy_substring(const char *subject, int *ovector,
329            int stringcount, int stringnumber, char *buffer,            int stringcount, int stringnumber, char *buffer,
330            int buffersize);            int buffersize);
331    
332         int pcre_get_named_substring(const pcre *code,
333              const char *subject, int *ovector,
334              int stringcount, const char *stringname,
335              const char **stringptr);
336    
337         int pcre_get_stringnumber(const pcre *code,
338              const char *name);
339    
340       int pcre_get_substring(const char *subject, int *ovector,       int pcre_get_substring(const char *subject, int *ovector,
341            int stringcount, int stringnumber,            int stringcount, int stringnumber,
342            const char **stringptr);            const char **stringptr);
# Line 28  SYNOPSIS Line 344  SYNOPSIS
344       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
345            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
346    
347         void pcre_free_substring(const char *stringptr);
348    
349         void pcre_free_substring_list(const char **stringptr);
350    
351       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
352    
353         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
354              int what, void *where);
355    
356    
357       int pcre_info(const pcre *code, int *optptr, *firstcharptr);       int pcre_info(const pcre *code, int *optptr, *firstcharptr);
358    
359         int pcre_config(int what, void *where);
360    
361       char *pcre_version(void);       char *pcre_version(void);
362    
363       void *(*pcre_malloc)(size_t);       void *(*pcre_malloc)(size_t);
364    
365       void (*pcre_free)(void *);       void (*pcre_free)(void *);
366    
367         int (*pcre_callout)(pcre_callout_block *);
368    
369    
370    PCRE API
 DESCRIPTION  
      The PCRE library is a set of functions that implement  regu-  
      lar  expression  pattern  matching using the same syntax and  
      semantics as Perl  5,  with  just  a  few  differences  (see  
      below).  The  current  implementation  corresponds  to  Perl  
      5.005.  
371    
372       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
373       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
374       correspond to the POSIX API.  These  are  described  in  the       correspond to the POSIX regular expression API.   These  are
375       pcreposix documentation.       described in the pcreposix documentation.
376    
377       The native API function prototypes are defined in the header       The native API function prototypes are defined in the header
378       file  pcre.h,  and  on  Unix  systems  the library itself is       file  pcre.h,  and  on  Unix  systems  the library itself is
379       called libpcre.a, so can be accessed by adding -lpcre to the       called libpcre.a, so can be accessed by adding -lpcre to the
380       command for linking an application which calls it.       command  for  linking  an  application  which  calls it. The
381         header file defines the macros PCRE_MAJOR and PCRE_MINOR  to
382         contain the major and minor release numbers for the library.
383         Applications can use these to include support for  different
384         releases.
385    
386       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
387       are  used  for  compiling  and matching regular expressions,       are  used  for compiling and matching regular expressions. A
388       while   pcre_copy_substring(),   pcre_get_substring(),   and       sample program that demonstrates the simplest way  of  using
389       pcre_get_substring_list()   are  convenience  functions  for       them  is  given in the file pcredemo.c. The pcresample docu-
390       extracting  captured  substrings  from  a  matched   subject       mentation describes how to run it.
391       string.  The function pcre_maketables() is used (optionally)  
392       to build a set of character tables in the current locale for       There are convenience functions for extracting captured sub-
393       passing to pcre_compile().       strings from a matched subject string. They are:
394    
395       The function pcre_info() is used  to  find  out  information         pcre_copy_substring()
396       about  a compiled pattern, while the function pcre_version()         pcre_copy_named_substring()
397       returns a pointer to a string containing the version of PCRE         pcre_get_substring()
398       and its date of release.         pcre_get_named_substring()
399           pcre_get_substring_list()
400    
401         pcre_free_substring()  and  pcre_free_substring_list()   are
402         also  provided,  to  free  the  memory  used  for  extracted
403         strings.
404    
405         The function pcre_maketables() is used (optionally) to build
406         a  set of character tables in the current locale for passing
407         to pcre_compile().
408    
409         The function pcre_fullinfo() is used to find out information
410         about a compiled pattern; pcre_info() is an obsolete version
411         which returns only some of the available information, but is
412         retained   for   backwards   compatibility.    The  function
413         pcre_version() returns a pointer to a string containing  the
414         version of PCRE and its date of release.
415    
416       The global variables  pcre_malloc  and  pcre_free  initially       The global variables  pcre_malloc  and  pcre_free  initially
417       contain the entry points of the standard malloc() and free()       contain the entry points of the standard malloc() and free()
# Line 78  DESCRIPTION Line 420  DESCRIPTION
420       replace them if it  wishes  to  intercept  the  calls.  This       replace them if it  wishes  to  intercept  the  calls.  This
421       should be done before calling any PCRE functions.       should be done before calling any PCRE functions.
422    
423         The global variable pcre_callout initially contains NULL. It
424         can be set by the caller to a "callout" function, which PCRE
425         will then call at specified points during a matching  opera-
426         tion. Details are given in the pcrecallout documentation.
427    
428    
429  MULTI-THREADING  MULTITHREADING
430    
431       The PCRE functions can be used in  multi-threading  applica-       The PCRE functions can be used in  multi-threading  applica-
432       tions, with the proviso that the memory management functions       tions, with the proviso that the memory management functions
433       pointed to by pcre_malloc and pcre_free are  shared  by  all       pointed to by pcre_malloc and  pcre_free,  and  the  callout
434         function  pointed  to  by  pcre_callout,  are  shared by all
435       threads.       threads.
436    
437       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
# Line 91  MULTI-THREADING Line 439  MULTI-THREADING
439       used by several threads at once.       used by several threads at once.
440    
441    
442    CHECKING BUILD-TIME OPTIONS
443    
444         int pcre_config(int what, void *where);
445    
446         The function pcre_config() makes  it  possible  for  a  PCRE
447         client  to  discover  which optional features have been com-
448         piled into the PCRE library. The pcrebuild documentation has
449         more details about these optional features.
450    
451         The first argument for pcre_config() is an integer, specify-
452         ing  which information is required; the second argument is a
453         pointer to a variable into which the information is  placed.
454         The following information is available:
455    
456           PCRE_CONFIG_UTF8
457    
458         The output is an integer that is set to one if UTF-8 support
459         is available; otherwise it is set to zero.
460    
461           PCRE_CONFIG_NEWLINE
462    
463         The output is an integer that is set to  the  value  of  the
464         code  that  is  used for the newline character. It is either
465         linefeed (10) or carriage return (13), and  should  normally
466         be the standard character for your operating system.
467    
468           PCRE_CONFIG_LINK_SIZE
469    
470         The output is an integer that contains the number  of  bytes
471         used  for  internal linkage in compiled regular expressions.
472         The value is 2, 3, or 4. Larger values allow larger  regular
473         expressions  to be compiled, at the expense of slower match-
474         ing. The default value of 2 is sufficient for  all  but  the
475         most  massive patterns, since it allows the compiled pattern
476         to be up to 64K in size.
477    
478           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
479    
480         The output is an integer that contains the  threshold  above
481         which  the POSIX interface uses malloc() for output vectors.
482         Further details are given in the pcreposix documentation.
483    
484           PCRE_CONFIG_MATCH_LIMIT
485    
486         The output is an integer that gives the  default  limit  for
487         the   number  of  internal  matching  function  calls  in  a
488         pcre_exec()  execution.  Further  details  are  given   with
489         pcre_exec() below.
490    
491    
492  COMPILING A PATTERN  COMPILING A PATTERN
493    
494         pcre *pcre_compile(const char *pattern, int options,
495              const char **errptr, int *erroffset,
496              const unsigned char *tableptr);
497    
498       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
499       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
500       by a binary zero, and is passed in the argument  pattern.  A       by a binary zero, and is passed in the argument  pattern.  A
501       pointer  to  a  single  block of memory that is obtained via       pointer  to  a  single  block of memory that is obtained via
502       pcre_malloc is returned. This contains the compiled code and       pcre_malloc is returned. This contains the compiled code and
503       related data. The pcre type is defined for this for conveni-       related  data.  The  pcre  type  is defined for the returned
504       ence, but in fact pcre is just a typedef for void, since the       block; this is a typedef for a structure whose contents  are
505       contents  of  the block are not externally defined. It is up       not  externally  defined. It is up to the caller to free the
506       to the caller to free  the  memory  when  it  is  no  longer       memory when it is no longer required.
507       required.  
508         Although the compiled code of a PCRE regex  is  relocatable,
509       The size of a compiled pattern is  roughly  proportional  to       that is, it does not depend on memory location, the complete
510       the length of the pattern string, except that each character       pcre data block is not fully relocatable,  because  it  con-
511       class (other than those containing just a single  character,       tains  a  copy of the tableptr argument, which is an address
512       negated  or  not)  requires 33 bytes, and repeat quantifiers       (see below).
      with a minimum greater than one or a bounded  maximum  cause  
      the  relevant  portions of the compiled pattern to be repli-  
      cated.  
   
513       The options argument contains independent bits  that  affect       The options argument contains independent bits  that  affect
514       the  compilation.  It  should  be  zero  if  no  options are       the  compilation.  It  should  be  zero  if  no  options are
515       required. Some of the options, in particular, those that are       required. Some of the options, in particular, those that are
516       compatible  with Perl, can also be set and unset from within       compatible  with Perl, can also be set and unset from within
517       the pattern (see the detailed description of regular expres-       the pattern (see the detailed description of regular expres-
518       sions below). For these options, the contents of the options       sions  in the pcrepattern documentation). For these options,
519       argument specifies their initial settings at  the  start  of       the contents of the options argument specifies their initial
520       compilation  and  execution. The PCRE_ANCHORED option can be       settings  at  the  start  of  compilation and execution. The
521       set at the time of matching as well as at compile time.       PCRE_ANCHORED option can be set at the time of  matching  as
522         well as at compile time.
523    
524       If errptr is NULL, pcre_compile() returns NULL  immediately.       If errptr is NULL, pcre_compile() returns NULL  immediately.
525       Otherwise, if compilation of a pattern fails, pcre_compile()       Otherwise, if compilation of a pattern fails, pcre_compile()
# Line 137  COMPILING A PATTERN Line 536  COMPILING A PATTERN
536       must  be  the result of a call to pcre_maketables(). See the       must  be  the result of a call to pcre_maketables(). See the
537       section on locale support below.       section on locale support below.
538    
539       The following option bits are defined in the header file:       This code fragment shows a typical straightforward  call  to
540         pcre_compile():
541    
542           pcre *re;
543           const char *error;
544           int erroffset;
545           re = pcre_compile(
546             "^A.*Z",          /* the pattern */
547             0,                /* default options */
548             &error,           /* for error message */
549             &erroffset,       /* for error offset */
550             NULL);            /* use default character tables */
551    
552         The following option bits are defined:
553    
554         PCRE_ANCHORED         PCRE_ANCHORED
555    
556       If this bit is set, the pattern is forced to be  "anchored",       If this bit is set, the pattern is forced to be  "anchored",
557       that is, it is constrained to match only at the start of the       that is, it is constrained to match only at the first match-
558       string which is being searched (the "subject string").  This       ing point in the string which is being searched  (the  "sub-
559       effect can also be achieved by appropriate constructs in the       ject string"). This effect can also be achieved by appropri-
560       pattern itself, which is the only way to do it in Perl.       ate constructs in the pattern itself, which is the only  way
561         to do it in Perl.
562    
563         PCRE_CASELESS         PCRE_CASELESS
564    
565       If this bit is set, letters in the pattern match both  upper       If this bit is set, letters in the pattern match both  upper
566       and  lower  case  letters.  It  is  equivalent  to Perl's /i       and  lower  case  letters.  It  is  equivalent  to Perl's /i
567       option.       option, and it can be changed within a  pattern  by  a  (?i)
568         option setting.
569    
570         PCRE_DOLLAR_ENDONLY         PCRE_DOLLAR_ENDONLY
571    
# Line 161  COMPILING A PATTERN Line 575  COMPILING A PATTERN
575       character  if it is a newline (but not before any other new-       character  if it is a newline (but not before any other new-
576       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if
577       PCRE_MULTILINE is set. There is no equivalent to this option       PCRE_MULTILINE is set. There is no equivalent to this option
578       in Perl.       in Perl, and no way to set it within a pattern.
579    
580         PCRE_DOTALL         PCRE_DOTALL
581    
582       If this bit is  set,  a  dot  metacharater  in  the  pattern       If this bit is  set,  a  dot  metacharater  in  the  pattern
583       matches all characters, including newlines. Without it, new-       matches all characters, including newlines. Without it, new-
584       lines are excluded. This option is equivalent to  Perl's  /s       lines are excluded. This option is equivalent to  Perl's  /s
585       option.  A negative class such as [^a] always matches a new-       option,  and  it  can  be changed within a pattern by a (?s)
586       line character, independent of the setting of this option.       option setting. A negative class such as [^a] always matches
587         a  newline  character,  independent  of  the setting of this
588         option.
589    
590         PCRE_EXTENDED         PCRE_EXTENDED
591    
592       If this bit is set, whitespace data characters in  the  pat-       If this bit is set, whitespace data characters in  the  pat-
593       tern  are  totally  ignored  except when escaped or inside a       tern  are  totally  ignored  except when escaped or inside a
594       character class, and characters between an unescaped #  out-       character class. Whitespace does not include the VT  charac-
595       side  a  character  class  and  the  next newline character,       ter  (code 11). In addition, characters between an unescaped
596         # outside a character class and the next newline  character,
597       inclusive, are also ignored. This is equivalent to Perl's /x       inclusive, are also ignored. This is equivalent to Perl's /x
598       option,  and  makes  it  possible to include comments inside       option, and it can be changed within a  pattern  by  a  (?x)
599       complicated patterns. Note, however, that this applies  only       option setting.
600       to  data  characters. Whitespace characters may never appear  
601         This option makes it possible  to  include  comments  inside
602         complicated patterns.  Note, however, that this applies only
603         to data characters. Whitespace characters may  never  appear
604       within special character sequences in a pattern, for example       within special character sequences in a pattern, for example
605       within  the sequence (?( which introduces a conditional sub-       within the sequence (?( which introduces a conditional  sub-
606       pattern.       pattern.
607    
608         PCRE_EXTRA         PCRE_EXTRA
609    
610       This option turns on additional functionality of  PCRE  that       This option was invented in  order  to  turn  on  additional
611       is  incompatible  with Perl. Any backslash in a pattern that       functionality of PCRE that is incompatible with Perl, but it
612       is followed by a letter that has no special  meaning  causes       is currently of very little use. When set, any backslash  in
613       an  error,  thus  reserving  these  combinations  for future       a  pattern  that is followed by a letter that has no special
614       expansion. By default, as in Perl, a backslash followed by a       meaning causes an error, thus reserving  these  combinations
615       letter  with  no  special  meaning  is treated as a literal.       for  future  expansion.  By default, as in Perl, a backslash
616       There are at present no other features  controlled  by  this       followed by a letter with no special meaning is treated as a
617       option.       literal.  There  are at present no other features controlled
618         by this option. It can also be set by a (?X) option  setting
619         within a pattern.
620    
621         PCRE_MULTILINE         PCRE_MULTILINE
622    
# Line 207  COMPILING A PATTERN Line 629  COMPILING A PATTERN
629       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
630    
631       When PCRE_MULTILINE it is set, the "start of line" and  "end       When PCRE_MULTILINE it is set, the "start of line" and  "end
632       of   line"   constructs   match   immediately  following  or       of  line"  constructs match immediately following or immedi-
633       immediately  before  any  newline  in  the  subject  string,       ately before any newline  in  the  subject  string,  respec-
634       respectively,  as well as at the very start and end. This is       tively,  as  well  as  at  the  very  start and end. This is
635       equivalent to Perl's /m option. If there are no "\n" charac-       equivalent to Perl's /m option, and it can be changed within
636       ters  in  a subject string, or no occurrences of ^ or $ in a       a  pattern  by  a  (?m) option setting. If there are no "\n"
637       pattern, setting PCRE_MULTILINE has no effect.       characters in a subject string, or no occurrences of ^ or  $
638         in a pattern, setting PCRE_MULTILINE has no effect.
639    
640           PCRE_NO_AUTO_CAPTURE
641    
642         If this option is set, it disables the use of numbered  cap-
643         turing  parentheses  in the pattern. Any opening parenthesis
644         that is not followed by ? behaves as if it were followed  by
645         ?:  but  named  parentheses  can still be used for capturing
646         (and they acquire numbers in the usual  way).  There  is  no
647         equivalent of this option in Perl.
648    
649         PCRE_UNGREEDY         PCRE_UNGREEDY
650    
# Line 221  COMPILING A PATTERN Line 653  COMPILING A PATTERN
653       followed by "?". It is not compatible with Perl. It can also       followed by "?". It is not compatible with Perl. It can also
654       be set by a (?U) option setting within the pattern.       be set by a (?U) option setting within the pattern.
655    
656           PCRE_UTF8
657    
658         This option causes PCRE to regard both the pattern  and  the
659         subject  as  strings  of UTF-8 characters instead of single-
660         byte character strings. However, it  is  available  only  if
661         PCRE  has  been  built to include UTF-8 support. If not, the
662         use of this option provokes an error. Details  of  how  this
663         option  changes  the behaviour of PCRE are given in the sec-
664         tion on UTF-8 support in the main pcre page.
665    
666           PCRE_NO_UTF8_CHECK
667    
668         When PCRE_UTF8 is set, the validity  of  the  pattern  as  a
669         UTF-8  string  is automatically checked. If an invalid UTF-8
670         sequence of bytes is found, pcre_compile() returns an error.
671         If you already know that your pattern is valid, and you want
672         to skip this check for performance reasons, you can set  the
673         PCRE_NO_UTF8_CHECK  option.  When  it  is set, the effect of
674         passing an invalid UTF-8 string as a pattern  is  undefined.
675         It  may  cause  your program to crash.  Note that there is a
676         similar option  for  suppressing  the  checking  of  subject
677         strings passed to pcre_exec().
678    
679    
680    
681  STUDYING A PATTERN  STUDYING A PATTERN
682    
683         pcre_extra *pcre_study(const pcre *code, int options,
684              const char **errptr);
685    
686       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
687       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
688       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
689       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to  a compiled pattern as its first argument. If
690       returns a  pointer  to  a  pcre_extra  block  (another  void       studing the pattern  produces  additional  information  that
691       typedef)  containing  additional  information about the pat-       will  help speed up matching, pcre_study() returns a pointer
692       tern; this can be passed to pcre_exec().  If  no  additional       to a pcre_extra block, in which the study_data field  points
693       information is available, NULL is returned.       to the results of the study.
694    
695         The  returned  value  from  a  pcre_study()  can  be  passed
696         directly  to pcre_exec(). However, the pcre_extra block also
697         contains other fields that can be set by the  caller  before
698         the  block is passed; these are described below. If studying
699         the pattern does not  produce  any  additional  information,
700         pcre_study() returns NULL. In that circumstance, if the cal-
701         ling program wants to pass  some  of  the  other  fields  to
702         pcre_exec(), it must set up its own pcre_extra block.
703    
704       The second argument contains option  bits.  At  present,  no       The second argument contains option  bits.  At  present,  no
705       options  are  defined  for  pcre_study(),  and this argument       options  are  defined  for  pcre_study(),  and this argument
706       should always be zero.       should always be zero.
707    
708       The third argument for pcre_study() is a pointer to an error       The third argument for pcre_study()  is  a  pointer  for  an
709       message. If studying succeeds (even if no data is returned),       error  message.  If  studying  succeeds  (even if no data is
710       the variable it points to  is  set  to  NULL.  Otherwise  it       returned), the variable it points to is set to NULL.  Other-
711       points to a textual error message.       wise it points to a textual error message. You should there-
712         fore  test  the  error  pointer  for  NULL   after   calling
713         pcre_study(), to be sure that it has run successfully.
714    
715         This is a typical call to pcre_study():
716    
717           pcre_extra *pe;
718           pe = pcre_study(
719             re,             /* result of pcre_compile() */
720             0,              /* no options exist */
721             &error);        /* set to NULL or points to a message */
722    
723       At present, studying a  pattern  is  useful  only  for  non-       At present, studying a  pattern  is  useful  only  for  non-
724       anchored  patterns  that do not have a single fixed starting       anchored  patterns  that do not have a single fixed starting
# Line 248  STUDYING A PATTERN Line 726  STUDYING A PATTERN
726       created.       created.
727    
728    
   
729  LOCALE SUPPORT  LOCALE SUPPORT
730    
731       PCRE handles caseless matching, and determines whether char-       PCRE handles caseless matching, and determines whether char-
732       acters  are  letters, digits, or whatever, by reference to a       acters  are  letters, digits, or whatever, by reference to a
733       set of tables. The library contains a default set of  tables       set of tables. When running in UTF-8 mode, this applies only
734       which  is  created in the default C locale when PCRE is com-       to characters with codes less than 256. The library contains
735       piled.  This  is   used   when   the   final   argument   of       a default set of tables that is created  in  the  default  C
736       pcre_compile()  is NULL, and is sufficient for many applica-       locale  when  PCRE  is compiled. This is used when the final
737       tions.       argument of pcre_compile() is NULL, and  is  sufficient  for
738         many applications.
739    
740       An alternative set of tables can, however, be supplied. Such       An alternative set of tables can, however, be supplied. Such
741       tables  are built by calling the pcre_maketables() function,       tables  are built by calling the pcre_maketables() function,
# Line 274  LOCALE SUPPORT Line 753  LOCALE SUPPORT
753       The  tables  are  built  in  memory  that  is  obtained  via       The  tables  are  built  in  memory  that  is  obtained  via
754       pcre_malloc.  The  pointer that is passed to pcre_compile is       pcre_malloc.  The  pointer that is passed to pcre_compile is
755       saved with the compiled pattern, and  the  same  tables  are       saved with the compiled pattern, and  the  same  tables  are
756       used  via this pointer by pcre_study() and pcre_exec(). Thus       used via this pointer by pcre_study() and pcre_exec(). Thus,
757       for any single pattern, compilation, studying  and  matching       for any single pattern, compilation, studying  and  matching
758       all happen in the same locale, but different patterns can be       all happen in the same locale, but different patterns can be
759       compiled in different locales. It is the caller's  responsi-       compiled in different locales. It is the caller's  responsi-
# Line 282  LOCALE SUPPORT Line 761  LOCALE SUPPORT
761       remains available for as long as it is needed.       remains available for as long as it is needed.
762    
763    
   
764  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
765       The pcre_info() function returns information  about  a  com-  
766       piled pattern.  Its yield is the number of capturing subpat-       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
767       terns, or one of the following negative numbers:            int what, void *where);
768    
769         The pcre_fullinfo() function  returns  information  about  a
770         compiled pattern. It replaces the obsolete pcre_info() func-
771         tion, which is nevertheless retained for backwards compabil-
772         ity (and is documented below).
773         The first argument for pcre_fullinfo() is a pointer  to  the
774         compiled  pattern.  The  second  argument  is  the result of
775         pcre_study(), or NULL if the pattern was  not  studied.  The
776         third  argument  specifies  which  piece  of  information is
777         required, and the fourth argument is a pointer to a variable
778         to  receive  the data. The yield of the function is zero for
779         success, or one of the following negative numbers:
780    
781         PCRE_ERROR_NULL       the argument code was NULL         PCRE_ERROR_NULL       the argument code was NULL
782                                 the argument where was NULL
783         PCRE_ERROR_BADMAGIC   the "magic number" was not found         PCRE_ERROR_BADMAGIC   the "magic number" was not found
784           PCRE_ERROR_BADOPTION  the value of what was invalid
785    
786       If the optptr argument is not NULL, a copy  of  the  options       Here is a typical call of  pcre_fullinfo(),  to  obtain  the
787       with which the pattern was compiled is placed in the integer       length of the compiled pattern:
      it points to. These option bits are those specified  in  the  
      call  to  pcre_compile(),  modified  by any top-level option  
      settings  within  the   pattern   itself,   and   with   the  
      PCRE_ANCHORED  bit  set  if  the form of the pattern implies  
      that it can match only at the start of a subject string.  
788    
789       If the pattern is not anchored and the firstcharptr argument         int rc;
790       is  not  NULL, it is used to pass back information about the         unsigned long int length;
791       first character of any matched string. If there is  a  fixed         rc = pcre_fullinfo(
792       first    character,    e.g.   from   a   pattern   such   as           re,               /* result of pcre_compile() */
793       (cat|cow|coyote), then it is returned in the integer pointed           pe,               /* result of pcre_study(), or NULL */
794       to by firstcharptr. Otherwise, if either           PCRE_INFO_SIZE,   /* what is required */
795             &length);         /* where to put the data */
796    
797         The possible values for the third argument  are  defined  in
798         pcre.h, and are as follows:
799    
800           PCRE_INFO_BACKREFMAX
801    
802         Return the number of the highest back reference in the  pat-
803         tern.  The  fourth argument should point to an int variable.
804         Zero is returned if there are no back references.
805    
806           PCRE_INFO_CAPTURECOUNT
807    
808         Return the number of capturing subpatterns in  the  pattern.
809         The fourth argument should point to an int variable.
810    
811           PCRE_INFO_FIRSTBYTE
812    
813         Return information about  the  first  byte  of  any  matched
814         string,  for a non-anchored pattern. (This option used to be
815         called PCRE_INFO_FIRSTCHAR; the old name is still recognized
816         for backwards compatibility.)
817    
818         If there is a fixed first byte, e.g. from a pattern such  as
819         (cat|cow|coyote),  it  is returned in the integer pointed to
820         by where. Otherwise, if either
821    
822       (a) the pattern was compiled with the PCRE_MULTILINE option,       (a) the pattern was compiled with the PCRE_MULTILINE option,
823       and every branch starts with "^", or       and every branch starts with "^", or
# Line 312  INFORMATION ABOUT A PATTERN Line 825  INFORMATION ABOUT A PATTERN
825       (b) every  branch  of  the  pattern  starts  with  ".*"  and       (b) every  branch  of  the  pattern  starts  with  ".*"  and
826       PCRE_DOTALL is not set (if it were set, the pattern would be       PCRE_DOTALL is not set (if it were set, the pattern would be
827       anchored),       anchored),
      then -1 is returned, indicating  that  the  pattern  matches  
      only  at  the  start  of  a subject string or after any "\n"  
      within the string. Otherwise -2 is returned.  
828    
829         -1 is returned, indicating that the pattern matches only  at
830         the  start  of  a subject string or after any newline within
831         the string. Otherwise -2 is returned. For anchored patterns,
832         -2 is returned.
833    
834           PCRE_INFO_FIRSTTABLE
835    
836         If the pattern was studied, and this resulted  in  the  con-
837         struction of a 256-bit table indicating a fixed set of bytes
838         for the first byte in any matching string, a pointer to  the
839         table  is  returned.  Otherwise NULL is returned. The fourth
840         argument should point to an unsigned char * variable.
841    
842           PCRE_INFO_LASTLITERAL
843    
844         Return the value of the rightmost  literal  byte  that  must
845         exist  in  any  matched  string, other than at its start, if
846         such a byte has been recorded. The  fourth  argument  should
847         point  to  an  int variable. If there is no such byte, -1 is
848         returned. For anchored patterns,  a  last  literal  byte  is
849         recorded  only  if  it follows something of variable length.
850         For example, for the pattern /^a\d+z\d+/ the returned  value
851         is "z", but for /^a\dz\d/ the returned value is -1.
852    
853           PCRE_INFO_NAMECOUNT
854           PCRE_INFO_NAMEENTRYSIZE
855           PCRE_INFO_NAMETABLE
856    
857         PCRE supports the use of named as well as numbered capturing
858         parentheses. The names are just an additional way of identi-
859         fying the parentheses,  which  still  acquire  a  number.  A
860         caller  that  wants  to extract data from a named subpattern
861         must convert the name to a number in  order  to  access  the
862         correct  pointers  in  the  output  vector  (described  with
863         pcre_exec() below). In order to do this, it must  first  use
864         these  three  values  to  obtain  the name-to-number mapping
865         table for the pattern.
866    
867         The  map  consists  of  a  number  of  fixed-size   entries.
868         PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and
869         PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both
870         of  these return an int value. The entry size depends on the
871         length of the longest name.  PCRE_INFO_NAMETABLE  returns  a
872         pointer to the first entry of the table (a pointer to char).
873         The first two bytes of each entry are the number of the cap-
874         turing parenthesis, most significant byte first. The rest of
875         the entry is the corresponding name,  zero  terminated.  The
876         names  are  in alphabetical order. For example, consider the
877         following pattern (assume PCRE_EXTENDED  is  set,  so  white
878         space - including newlines - is ignored):
879    
880           (?P<date> (?P<year>(\d\d)?\d\d) -
881           (?P<month>\d\d) - (?P<day>\d\d) )
882    
883         There are four named subpatterns,  so  the  table  has  four
884         entries,  and  each  entry in the table is eight bytes long.
885         The table is as follows, with non-printing  bytes  shows  in
886         hex, and undefined bytes shown as ??:
887    
888           00 01 d  a  t  e  00 ??
889           00 05 d  a  y  00 ?? ??
890           00 04 m  o  n  t  h  00
891           00 02 y  e  a  r  00 ??
892    
893         When writing code to extract data  from  named  subpatterns,
894         remember  that the length of each entry may be different for
895         each compiled pattern.
896    
897           PCRE_INFO_OPTIONS
898    
899         Return a copy of the options with which the pattern was com-
900         piled.  The fourth argument should point to an unsigned long
901         int variable. These option bits are those specified  in  the
902         call  to  pcre_compile(),  modified  by any top-level option
903         settings within the pattern itself.
904    
905         A pattern is automatically anchored by PCRE if  all  of  its
906         top-level alternatives begin with one of the following:
907    
908           ^     unless PCRE_MULTILINE is set
909           \A    always
910           \G    always
911           .*    if PCRE_DOTALL is set and there are no back
912                   references to the subpattern in which .* appears
913    
914         For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the
915         options returned by pcre_fullinfo().
916    
917           PCRE_INFO_SIZE
918    
919         Return the size of the compiled pattern, that is, the  value
920         that  was  passed as the argument to pcre_malloc() when PCRE
921         was getting memory in which to place the compiled data.  The
922         fourth argument should point to a size_t variable.
923    
924           PCRE_INFO_STUDYSIZE
925    
926         Returns the size  of  the  data  block  pointed  to  by  the
927         study_data  field  in a pcre_extra block. That is, it is the
928         value that was passed to pcre_malloc() when PCRE was getting
929         memory into which to place the data created by pcre_study().
930         The fourth argument should point to a size_t variable.
931    
932    
933    OBSOLETE INFO FUNCTION
934    
935         int pcre_info(const pcre *code, int *optptr, *firstcharptr);
936    
937         The pcre_info() function is now obsolete because its  inter-
938         face  is  too  restrictive  to return all the available data
939         about  a  compiled  pattern.   New   programs   should   use
940         pcre_fullinfo()  instead.  The  yield  of pcre_info() is the
941         number of capturing subpatterns, or  one  of  the  following
942         negative numbers:
943    
944           PCRE_ERROR_NULL       the argument code was NULL
945           PCRE_ERROR_BADMAGIC   the "magic number" was not found
946    
947         If the optptr argument is not NULL, a copy  of  the  options
948         with which the pattern was compiled is placed in the integer
949         it points to (see PCRE_INFO_OPTIONS above).
950    
951         If the pattern is not anchored and the firstcharptr argument
952         is  not  NULL, it is used to pass back information about the
953         first    character    of    any    matched    string    (see
954         PCRE_INFO_FIRSTBYTE above).
955    
956    
957  MATCHING A PATTERN  MATCHING A PATTERN
958    
959         int pcre_exec(const pcre *code, const pcre_extra *extra,
960              const char *subject, int length, int startoffset,
961              int options, int *ovector, int ovecsize);
962    
963       The function pcre_exec() is called to match a subject string       The function pcre_exec() is called to match a subject string
964       against  a pre-compiled pattern, which is passed in the code       against  a pre-compiled pattern, which is passed in the code
965       argument. If the pattern has been studied, the result of the       argument. If the pattern has been studied, the result of the
966       study should be passed in the extra argument. Otherwise this       study should be passed in the extra argument.
967       must be NULL.  
968         Here is an example of a simple call to pcre_exec():
969    
970           int rc;
971           int ovector[30];
972           rc = pcre_exec(
973             re,             /* result of pcre_compile() */
974             NULL,           /* we didn't study the pattern */
975             "some string",  /* the subject string */
976             11,             /* the length of the subject string */
977             0,              /* start at offset 0 in the subject */
978             0,              /* default options */
979             ovector,        /* vector for substring information */
980             30);            /* number of elements in the vector */
981    
982         If the extra argument is  not  NULL,  it  must  point  to  a
983         pcre_extra  data  block.  The  pcre_study() function returns
984         such a block (when it doesn't return NULL), but you can also
985         create  one for yourself, and pass additional information in
986         it. The fields in the block are as follows:
987    
988           unsigned long int flags;
989           void *study_data;
990           unsigned long int match_limit;
991           void *callout_data;
992    
993         The flags field is a bitmap  that  specifies  which  of  the
994         other fields are set. The flag bits are:
995    
996           PCRE_EXTRA_STUDY_DATA
997           PCRE_EXTRA_MATCH_LIMIT
998           PCRE_EXTRA_CALLOUT_DATA
999    
1000         Other flag bits should be set to zero. The study_data  field
1001         is   set  in  the  pcre_extra  block  that  is  returned  by
1002         pcre_study(), together with the appropriate  flag  bit.  You
1003         should  not  set this yourself, but you can add to the block
1004         by setting the other fields.
1005    
1006         The match_limit field provides a means  of  preventing  PCRE
1007         from  using  up a vast amount of resources when running pat-
1008         terns that are not going to match, but  which  have  a  very
1009         large  number  of  possibilities  in their search trees. The
1010         classic example is the  use  of  nested  unlimited  repeats.
1011         Internally,  PCRE  uses  a  function called match() which it
1012         calls  repeatedly  (sometimes  recursively).  The  limit  is
1013         imposed  on the number of times this function is called dur-
1014         ing a match, which has the effect of limiting the amount  of
1015         recursion and backtracking that can take place. For patterns
1016         that are not anchored, the count starts from zero  for  each
1017         position in the subject string.
1018    
1019         The default limit for the library can be set  when  PCRE  is
1020         built;  the default default is 10 million, which handles all
1021         but the most extreme cases. You can reduce  the  default  by
1022         suppling  pcre_exec()  with  a  pcre_extra  block  in  which
1023         match_limit   is   set   to    a    smaller    value,    and
1024         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the
1025         limit      is      exceeded,       pcre_exec()       returns
1026         PCRE_ERROR_MATCHLIMIT.
1027    
1028         The pcre_callout field is used in conjunction with the "cal-
1029         lout"  feature,  which is described in the pcrecallout docu-
1030         mentation.
1031    
1032       The PCRE_ANCHORED option can be passed in the options  argu-       The PCRE_ANCHORED option can be passed in the options  argu-
1033       ment,  whose unused bits must be zero. However, if a pattern       ment,   whose   unused   bits  must  be  zero.  This  limits
1034       was  compiled  with  PCRE_ANCHORED,  or  turned  out  to  be       pcre_exec() to matching at the first matching position. How-
1035       anchored  by  virtue  of  its  contents,  it  cannot be made       ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or
1036       unachored at matching time.       turned out to be anchored by virtue of its contents, it can-
1037         not be made unachored at matching time.
1038    
1039         When PCRE_UTF8 was set at compile time, the validity of  the
1040         subject  as  a  UTF-8 string is automatically checked. If an
1041         invalid  UTF-8  sequence  of  bytes  is  found,  pcre_exec()
1042         returns  the  error  PCRE_ERROR_BADUTF8. If you already know
1043         that your subject is valid, and you want to skip this  check
1044         for  performance reasons, you can set the PCRE_NO_UTF8_CHECK
1045         option when calling pcre_exec(). When this  option  is  set,
1046         the  effect  of passing an invalid UTF-8 string as a subject
1047         is undefined. It may cause your program to crash.
1048    
1049       There are also three further options that can be set only at       There are also three further options that can be set only at
1050       matching time:       matching time:
# Line 373  MATCHING A PATTERN Line 1088  MATCHING A PATTERN
1088       advancing the starting offset  (see  below)  and  trying  an       advancing the starting offset  (see  below)  and  trying  an
1089       ordinary match again.       ordinary match again.
1090    
1091       The subject string is passed as  a  pointer  in  subject,  a       The subject string is passed to pcre_exec() as a pointer  in
1092       length  in  length,  and  a  starting offset in startoffset.       subject,  a length in length, and a starting offset in star-
1093       Unlike the pattern string, it may contain binary zero  char-       toffset. Unlike the pattern string, the subject may  contain
1094       acters.  When  the starting offset is zero, the search for a       binary  zero  bytes.  When  the starting offset is zero, the
1095       match starts at the beginning of the subject, and this is by       search for a match starts at the beginning of  the  subject,
1096       far the most common case.       and this is by far the most common case.
1097    
1098         If the pattern was compiled with the PCRE_UTF8  option,  the
1099         subject  must  be  a sequence of bytes that is a valid UTF-8
1100         string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's
1101         behaviour is not defined.
1102    
1103       A non-zero starting offset  is  useful  when  searching  for       A non-zero starting offset  is  useful  when  searching  for
1104       another  match  in  the  same subject by calling pcre_exec()       another  match  in  the  same subject by calling pcre_exec()
# Line 415  MATCHING A PATTERN Line 1135  MATCHING A PATTERN
1135       used for a fragment of a pattern that picks out a substring.       used for a fragment of a pattern that picks out a substring.
1136       PCRE supports several other kinds of  parenthesized  subpat-       PCRE supports several other kinds of  parenthesized  subpat-
1137       tern that do not cause substrings to be captured.       tern that do not cause substrings to be captured.
   
1138       Captured substrings are returned to the caller via a  vector       Captured substrings are returned to the caller via a  vector
1139       of  integer  offsets whose address is passed in ovector. The       of  integer  offsets whose address is passed in ovector. The
1140       number of elements in the vector is passed in ovecsize.  The       number of elements in the vector is passed in ovecsize.  The
# Line 471  MATCHING A PATTERN Line 1190  MATCHING A PATTERN
1190       Note that pcre_info() can be used to find out how many  cap-       Note that pcre_info() can be used to find out how many  cap-
1191       turing  subpatterns  there  are  in  a compiled pattern. The       turing  subpatterns  there  are  in  a compiled pattern. The
1192       smallest size for ovector that will  allow  for  n  captured       smallest size for ovector that will  allow  for  n  captured
1193       substrings  in  addition  to  the  offsets  of the substring       substrings,  in  addition  to  the  offsets of the substring
1194       matched by the whole pattern is (n+1)*3.       matched by the whole pattern, is (n+1)*3.
1195    
1196       If pcre_exec() fails, it returns a negative number. The fol-       If pcre_exec() fails, it returns a negative number. The fol-
1197       lowing are defined in the header file:       lowing are defined in the header file:
# Line 512  MATCHING A PATTERN Line 1231  MATCHING A PATTERN
1231       pcre_malloc() fails, this error  is  given.  The  memory  is       pcre_malloc() fails, this error  is  given.  The  memory  is
1232       freed at the end of matching.       freed at the end of matching.
1233    
1234           PCRE_ERROR_NOSUBSTRING    (-7)
1235    
1236         This   error   is   used   by   the   pcre_copy_substring(),
1237         pcre_get_substring(),  and  pcre_get_substring_list()  func-
1238         tions (see below). It is never returned by pcre_exec().
1239    
1240           PCRE_ERROR_MATCHLIMIT     (-8)
1241    
1242         The recursion and backtracking limit, as  specified  by  the
1243         match_limit  field  in a pcre_extra structure (or defaulted)
1244         was reached. See the description above.
1245    
1246           PCRE_ERROR_CALLOUT        (-9)
1247    
1248         This error is never generated by pcre_exec() itself.  It  is
1249         provided  for  use by callout functions that want to yield a
1250         distinctive error code. See  the  pcrecallout  documentation
1251         for details.
1252    
1253           PCRE_ERROR_BADUTF8       (-10)
1254    
1255         A string that contains an invalid UTF-8  byte  sequence  was
1256         passed as a subject.
1257    
1258    
1259    EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1260    
1261         int pcre_copy_substring(const char *subject, int *ovector,
1262              int stringcount, int stringnumber, char *buffer,
1263              int buffersize);
1264    
1265         int pcre_get_substring(const char *subject, int *ovector,
1266              int stringcount, int stringnumber,
1267              const char **stringptr);
1268    
1269         int pcre_get_substring_list(const char *subject,
1270              int *ovector, int stringcount, const char ***listptr);
1271    
 EXTRACTING CAPTURED SUBSTRINGS  
1272       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
1273       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
1274       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
1275       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
1276       captured  substrings  as  new,   separate,   zero-terminated       captured  substrings  as  new,   separate,   zero-terminated
1277         strings.  These functions identify substrings by number. The
1278         next section describes functions for extracting  named  sub-
1279       strings.   A  substring  that  contains  a  binary  zero  is       strings.   A  substring  that  contains  a  binary  zero  is
1280       correctly extracted and has a further zero added on the end,       correctly extracted and has a further zero added on the end,
1281       but the result does not, of course, function as a C string.       but the result is not, of course, a C string.
1282    
1283       The first three arguments are the same for all  three  func-       The first three arguments are the  same  for  all  three  of
1284       tions:  subject  is  the  subject string which has just been       these  functions:   subject  is the subject string which has
1285       successfully matched, ovector is a pointer to the vector  of       just been successfully matched, ovector is a pointer to  the
1286       integer   offsets   that  was  passed  to  pcre_exec(),  and       vector  of  integer  offsets that was passed to pcre_exec(),
1287       stringcount is the number of substrings that  were  captured       and stringcount is the number of substrings that  were  cap-
1288       by  the  match,  including  the  substring  that matched the       tured by the match, including the substring that matched the
1289       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
1290       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
1291       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
1292       tor, then the value passed as stringcount should be the size       tor,  the  value passed as stringcount should be the size of
1293       of the vector divided by three.       the vector divided by three.
   
1294       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
1295       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
1296       number. A value of zero extracts the substring that  matched       number. A value of zero extracts the substring that  matched
1297       the entire pattern, while higher values extract the captured       the entire pattern, while higher values extract the captured
1298       substrings. For pcre_copy_substring(), the string is  placed       substrings. For pcre_copy_substring(), the string is  placed
1299       in  buffer,  whose  length is given by buffersize, while for       in  buffer,  whose  length is given by buffersize, while for
1300       pcre_get_substring() a new block of store  is  obtained  via       pcre_get_substring() a new block of memory is  obtained  via
1301       pcre_malloc,  and its address is returned via stringptr. The       pcre_malloc,  and its address is returned via stringptr. The
1302       yield of the function is  the  length  of  the  string,  not       yield of the function is  the  length  of  the  string,  not
1303       including the terminating zero, or one of       including the terminating zero, or one of
# Line 576  EXTRACTING CAPTURED SUBSTRINGS Line 1331  EXTRACTING CAPTURED SUBSTRINGS
1331       inspecting the appropriate offset in ovector, which is nega-       inspecting the appropriate offset in ovector, which is nega-
1332       tive for unset substrings.       tive for unset substrings.
1333    
1334         The  two  convenience  functions  pcre_free_substring()  and
1335         pcre_free_substring_list()  can  be  used to free the memory
1336         returned by  a  previous  call  of  pcre_get_substring()  or
1337         pcre_get_substring_list(),  respectively.  They  do  nothing
1338         more than call the function pointed to by  pcre_free,  which
1339         of  course  could  be called directly from a C program. How-
1340         ever, PCRE is used in some situations where it is linked via
1341         a  special  interface  to another programming language which
1342         cannot use pcre_free directly; it is for  these  cases  that
1343         the functions are provided.
1344    
1345    
1346    EXTRACTING CAPTURED SUBSTRINGS BY NAME
1347    
1348         int pcre_copy_named_substring(const pcre *code,
1349              const char *subject, int *ovector,
1350              int stringcount, const char *stringname,
1351              char *buffer, int buffersize);
1352    
1353         int pcre_get_stringnumber(const pcre *code,
1354              const char *name);
1355    
1356         int pcre_get_named_substring(const pcre *code,
1357              const char *subject, int *ovector,
1358              int stringcount, const char *stringname,
1359              const char **stringptr);
1360    
1361         To extract a substring by name, you first have to find asso-
1362         ciated    number.    This    can    be   done   by   calling
1363         pcre_get_stringnumber(). The first argument is the  compiled
1364         pattern,  and  the second is the name. For example, for this
1365         pattern
1366    
1367           ab(?<xxx>\d+)...
1368    
1369  LIMITATIONS       the number of the subpattern called "xxx" is  1.  Given  the
1370       There are some size limitations in PCRE but it is hoped that       number,  you can then extract the substring directly, or use
1371       they will never in practice be relevant.  The maximum length       one of the functions described in the previous section.  For
1372       of a compiled pattern is 65539 (sic) bytes.  All  values  in       convenience,  there are also two functions that do the whole
1373       repeating  quantifiers must be less than 65536.  The maximum       job.
1374       number of capturing subpatterns is 99.  The  maximum  number  
1375       of  all  parenthesized subpatterns, including capturing sub-       Most of the  arguments  of  pcre_copy_named_substring()  and
1376       patterns, assertions, and other types of subpattern, is 200.       pcre_get_named_substring()  are  the  same  as those for the
1377         functions that  extract  by  number,  and  so  are  not  re-
1378         described here. There are just two differences.
1379    
1380         First, instead of a substring number, a  substring  name  is
1381         given.  Second,  there  is  an  extra argument, given at the
1382         start, which is a pointer to the compiled pattern.  This  is
1383         needed  in order to gain access to the name-to-number trans-
1384         lation table.
1385    
1386         These functions  call  pcre_get_stringnumber(),  and  if  it
1387         succeeds,    they   then   call   pcre_copy_substring()   or
1388         pcre_get_substring(), as appropriate.
1389    
1390    Last updated: 20 August 2003
1391    Copyright (c) 1997-2003 University of Cambridge.
1392    -----------------------------------------------------------------------------
1393    
1394    NAME
1395         PCRE - Perl-compatible regular expressions
1396    
      The maximum length of a subject string is the largest  posi-  
      tive number that an integer variable can hold. However, PCRE  
      uses recursion to handle subpatterns and indefinite  repeti-  
      tion.  This  means  that the available stack space may limit  
      the size of a subject string that can be processed  by  cer-  
      tain patterns.  
1397    
1398    PCRE CALLOUTS
1399    
1400         int (*pcre_callout)(pcre_callout_block *);
1401    
1402         PCRE provides a feature called "callout", which is  a  means
1403         of  temporarily passing control to the caller of PCRE in the
1404         middle of pattern matching. The caller of PCRE  provides  an
1405         external  function  by putting its entry point in the global
1406         variable pcre_callout. By default,  this  variable  contains
1407         NULL, which disables all calling out.
1408    
1409         Within a regular expression, (?C) indicates  the  points  at
1410         which  the external function is to be called. Different cal-
1411         lout points can be identified by putting a number less  than
1412         256  after  the  letter  C.  The default value is zero.  For
1413         example, this pattern has two callout points:
1414    
1415           (?C1)9abc(?C2)def
1416    
1417         During matching, when PCRE  reaches  a  callout  point  (and
1418         pcre_callout  is  set), the external function is called. Its
1419         only argument is a pointer to  a  pcre_callout  block.  This
1420         contains the following variables:
1421    
1422           int          version;
1423           int          callout_number;
1424           int         *offset_vector;
1425           const char  *subject;
1426           int          subject_length;
1427           int          start_match;
1428           int          current_position;
1429           int          capture_top;
1430           int          capture_last;
1431           void        *callout_data;
1432    
1433         The version field  is  an  integer  containing  the  version
1434         number of the block format. The current version is zero. The
1435         version number may change in future if additional fields are
1436         added,  but  the  intention  is  never  to remove any of the
1437         existing fields.
1438    
1439         The callout_number field contains the number of the callout,
1440         as compiled into the pattern (that is, the number after ?C).
1441    
1442         The offset_vector field  is  a  pointer  to  the  vector  of
1443         offsets  that  was  passed by the caller to pcre_exec(). The
1444         contents can be inspected in  order  to  extract  substrings
1445         that  have  been  matched  so  far,  in  the same way as for
1446         extracting substrings after a match has completed.
1447         The subject and subject_length  fields  contain  copies  the
1448         values that were passed to pcre_exec().
1449    
1450         The start_match field contains the offset within the subject
1451         at  which  the current match attempt started. If the pattern
1452         is not anchored, the callout function may be called  several
1453         times for different starting points.
1454    
1455         The current_position field contains the  offset  within  the
1456         subject of the current match pointer.
1457    
1458         The capture_top field contains one more than the  number  of
1459         the  highest  numbered captured substring so far. If no sub-
1460         strings have been captured, the value of capture_top is one.
1461    
1462         The capture_last field  contains  the  number  of  the  most
1463         recently captured substring.
1464    
1465         The callout_data field contains a value that  is  passed  to
1466         pcre_exec()  by  the  caller  specifically so that it can be
1467         passed back in callouts. It is passed  in  the  pcre_callout
1468         field  of the pcre_extra data structure. If no such data was
1469         passed, the value of callout_data in a pcre_callout block is
1470         NULL.  There is a description of the pcre_extra structure in
1471         the pcreapi documentation.
1472    
1473    
1474    
1475    RETURN VALUES
1476    
1477         The callout function returns an integer.  If  the  value  is
1478         zero,  matching  proceeds as normal. If the value is greater
1479         than zero, matching fails at the current  point,  but  back-
1480         tracking  to test other possibilities goes ahead, just as if
1481         a lookahead assertion had failed. If the value is less  than
1482         zero,  the  match  is abandoned, and pcre_exec() returns the
1483         value.
1484    
1485         Negative values should normally be chosen from  the  set  of
1486         PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH
1487         forces a standard "no  match"  failure.   The  error  number
1488         PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1489         it will never be used by PCRE itself.
1490    
1491    Last updated: 21 January 2003
1492    Copyright (c) 1997-2003 University of Cambridge.
1493    -----------------------------------------------------------------------------
1494    
1495    NAME
1496         PCRE - Perl-compatible regular expressions
1497    
1498    
1499  DIFFERENCES FROM PERL  DIFFERENCES FROM PERL
      The differences described here  are  with  respect  to  Perl  
      5.005.  
1500    
1501       1. By default, a whitespace character is any character  that       This document describes the differences  in  the  ways  that
1502       the  C  library  function isspace() recognizes, though it is       PCRE  and  Perl  handle regular expressions. The differences
1503       possible to compile PCRE  with  alternative  character  type       described here are with respect to Perl 5.8.
      tables. Normally isspace() matches space, formfeed, newline,  
      carriage return, horizontal tab, and vertical tab. Perl 5 no  
      longer  includes vertical tab in its set of whitespace char-  
      acters. The \v escape that was in the Perl documentation for  
      a long time was never in fact recognized. However, the char-  
      acter itself was treated as whitespace at least up to 5.002.  
      In 5.004 and 5.005 it does not match \s.  
1504    
1505       2. PCRE does  not  allow  repeat  quantifiers  on  lookahead       1. PCRE does  not  allow  repeat  quantifiers  on  lookahead
1506       assertions. Perl permits them, but they do not mean what you       assertions. Perl permits them, but they do not mean what you
1507       might think. For example, (?!a){3} does not assert that  the       might think. For example, (?!a){3} does not assert that  the
1508       next  three characters are not "a". It just asserts that the       next  three characters are not "a". It just asserts that the
1509       next character is not "a" three times.       next character is not "a" three times.
1510    
1511       3. Capturing subpatterns that occur inside  negative  looka-       2. Capturing subpatterns that occur inside  negative  looka-
1512       head  assertions  are  counted,  but  their  entries  in the       head  assertions  are  counted,  but  their  entries  in the
1513       offsets vector are never set. Perl sets its numerical  vari-       offsets vector are never set. Perl sets its numerical  vari-
1514       ables  from  any  such  patterns that are matched before the       ables  from  any  such  patterns that are matched before the
# Line 626  DIFFERENCES FROM PERL Line 1516  DIFFERENCES FROM PERL
1516       only  if  the negative lookahead assertion contains just one       only  if  the negative lookahead assertion contains just one
1517       branch.       branch.
1518    
1519       4. Though binary zero characters are supported in  the  sub-       3. Though binary zero characters are supported in  the  sub-
1520       ject  string,  they  are  not  allowed  in  a pattern string       ject  string,  they  are  not  allowed  in  a pattern string
1521       because it is passed as a normal  C  string,  terminated  by       because it is passed as a normal  C  string,  terminated  by
1522       zero. The escape sequence "\0" can be used in the pattern to       zero. The escape sequence "\0" can be used in the pattern to
1523       represent a binary zero.       represent a binary zero.
1524    
1525       5. The following Perl escape sequences  are  not  supported:       4. The following Perl escape sequences  are  not  supported:
1526       \l,  \u,  \L,  \U,  \E, \Q. In fact these are implemented by       \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-
1527       Perl's general string-handling and are not part of its  pat-       mented by Perl's general string-handling and are not part of
1528       tern matching engine.       its pattern matching engine. If any of these are encountered
1529         by PCRE, an error is generated.
1530       6. The Perl \G assertion is  not  supported  as  it  is  not  
1531       relevant to single pattern matches.       5. PCRE does support the \Q...\E  escape  for  quoting  sub-
1532         strings. Characters in between are treated as literals. This
1533       7. Fairly obviously, PCRE does  not  support  the  (?{code})       is slightly different from Perl in that $  and  @  are  also
1534       construction.       handled  as  literals inside the quotes. In Perl, they cause
1535         variable interpolation (but of course  PCRE  does  not  have
1536       8. There are at the time of writing some  oddities  in  Perl       variables). Note the following examples:
1537       5.005_02  concerned  with  the  settings of captured strings  
1538       when part of a pattern is repeated.  For  example,  matching           Pattern            PCRE matches      Perl matches
1539       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value  
1540       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2           \Qabc$xyz\E        abc$xyz           abc followed by the
1541       unset.    However,    if   the   pattern   is   changed   to                                                  contents of $xyz
1542       /^(aa(b(b))?)+$/ then $2 (and $3) get set.           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1543             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1544       In Perl 5.004 $2 is set in both cases, and that is also true  
1545       of PCRE. If in the future Perl changes to a consistent state       In PCRE, the \Q...\E mechanism is not  recognized  inside  a
1546       that is different, PCRE may change to follow.       character class.
1547    
1548       9. Another as yet unresolved discrepancy  is  that  in  Perl       8. Fairly obviously, PCRE does not support the (?{code}) and
1549       5.005_02  the  pattern /^(a)?(?(1)a|b)+$/ matches the string       (?p{code})  constructions. However, there is some experimen-
1550       "a", whereas in PCRE it does not.  However, in both Perl and       tal support for recursive patterns using the non-Perl  items
1551       PCRE /^(a)?a/ matched against "a" leaves $1 unset.       (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"
1552         feature allows an external function to be called during pat-
1553         tern matching.
1554    
1555         9. There are some differences that are  concerned  with  the
1556         settings  of  captured  strings  when  part  of a pattern is
1557         repeated. For example, matching "aba"  against  the  pattern
1558         /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set
1559         to "b".
1560    
1561       10. PCRE  provides  some  extensions  to  the  Perl  regular       10. PCRE  provides  some  extensions  to  the  Perl  regular
1562       expression facilities:       expression facilities:
1563    
1564       (a) Although lookbehind assertions must match  fixed  length       (a) Although lookbehind assertions must match  fixed  length
1565       strings,  each  alternative branch of a lookbehind assertion       strings,  each  alternative branch of a lookbehind assertion
1566       can match a different length of string. Perl 5.005  requires       can match a different length of string. Perl  requires  them
1567       them all to have the same length.       all to have the same length.
1568    
1569       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not
1570       set,  the  $ meta- character matches only at the very end of       set,  the  $  meta-character matches only at the very end of
1571       the string.       the string.
1572    
1573       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
1574       with no special meaning is faulted.       with no special meaning is faulted.
1575    
1576       (d)  If  PCRE_UNGREEDY  is  set,  the  greediness   of   the       (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-
1577       repetition quantifiers is inverted, that is, by default they       tion  quantifiers  is inverted, that is, by default they are
1578       are not greedy, but if followed by a question mark they are.       not greedy, but if followed by a question mark they are.
1579    
1580       (e) PCRE_ANCHORED can be used to force a pattern to be tried       (e) PCRE_ANCHORED can be used to force a pattern to be tried
1581       only at the start of the subject.       only at the first matching position in the subject string.
1582    
1583       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options       (f)  The  PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   and
1584       for pcre_exec() have no Perl equivalents.       PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl
1585         equivalents.
1586    
1587         (g) The (?R), (?number), and (?P>name) constructs allows for
1588         recursive  pattern  matching  (Perl  can  do  this using the
1589         (?p{code}) construct, which PCRE cannot support.)
1590    
1591         (h) PCRE supports  named  capturing  substrings,  using  the
1592         Python syntax.
1593    
1594         (i) PCRE supports the  possessive  quantifier  "++"  syntax,
1595         taken from Sun's Java package.
1596    
1597         (j) The (R) condition, for  testing  recursion,  is  a  PCRE
1598         extension.
1599    
1600         (k) The callout facility is PCRE-specific.
1601    
1602    Last updated: 03 February 2003
1603    Copyright (c) 1997-2003 University of Cambridge.
1604    -----------------------------------------------------------------------------
1605    
1606    NAME
1607         PCRE - Perl-compatible regular expressions
1608    
1609    
1610    PCRE REGULAR EXPRESSION DETAILS
1611    
 REGULAR EXPRESSION DETAILS  
1612       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
1613       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
1614       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
1615       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
1616       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
1617       O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.       O'Reilly,  covers them in great detail. The description here
1618       The description here is intended as reference documentation.       is intended as reference documentation.
1619    
1620         The basic operation of PCRE is on strings of bytes. However,
1621         there  is  also  support for UTF-8 character strings. To use
1622         this support you must build PCRE to include  UTF-8  support,
1623         and  then call pcre_compile() with the PCRE_UTF8 option. How
1624         this affects the pattern matching is  mentioned  in  several
1625         places  below.  There is also a summary of UTF-8 features in
1626         the section on UTF-8 support in the main pcre page.
1627    
1628       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
1629       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 716  REGULAR EXPRESSION DETAILS Line 1645  REGULAR EXPRESSION DETAILS
1645       Outside square brackets, the meta-characters are as follows:       Outside square brackets, the meta-characters are as follows:
1646    
1647         \      general escape character with several uses         \      general escape character with several uses
1648         ^      assert start of  subject  (or  line,  in  multiline         ^      assert start of string (or line, in multiline mode)
1649       mode)         $      assert end of string (or line, in multiline mode)
        $      assert end of subject (or line, in multiline mode)  
1650         .      match any character except newline (by default)         .      match any character except newline (by default)
1651         [      start character class definition         [      start character class definition
1652         |      start of alternative branch         |      start of alternative branch
# Line 729  REGULAR EXPRESSION DETAILS Line 1657  REGULAR EXPRESSION DETAILS
1657                also quantifier minimizer                also quantifier minimizer
1658         *      0 or more quantifier         *      0 or more quantifier
1659         +      1 or more quantifier         +      1 or more quantifier
1660                  also "possessive quantifier"
1661         {      start min/max quantifier         {      start min/max quantifier
1662    
1663       Part of a pattern that is in square  brackets  is  called  a       Part of a pattern that is in square  brackets  is  called  a
# Line 738  REGULAR EXPRESSION DETAILS Line 1667  REGULAR EXPRESSION DETAILS
1667         \      general escape character         \      general escape character
1668         ^      negate the class, but only if the first character         ^      negate the class, but only if the first character
1669         -      indicates character range         -      indicates character range
1670           [      POSIX character class (only if followed by POSIX
1671                    syntax)
1672         ]      terminates the character class         ]      terminates the character class
1673    
1674       The following sections describe  the  use  of  each  of  the       The following sections describe  the  use  of  each  of  the
1675       meta-characters.       meta-characters.
1676    
1677    
   
1678  BACKSLASH  BACKSLASH
1679    
1680       The backslash character has several uses. Firstly, if it  is       The backslash character has several uses. Firstly, if it  is
1681       followed  by  a  non-alphameric character, it takes away any       followed  by  a  non-alphameric character, it takes away any
1682       special  meaning  that  character  may  have.  This  use  of       special  meaning  that  character  may  have.  This  use  of
1683       backslash  as  an  escape  character applies both inside and       backslash  as  an  escape  character applies both inside and
1684       outside character classes.       outside character classes.
1685    
1686       For example, if you want to match a "*" character, you write       For example, if you want to match a * character,  you  write
1687       "\*" in the pattern. This applies whether or not the follow-       \*  in the pattern.  This escaping action applies whether or
1688       ing character would otherwise  be  interpreted  as  a  meta-       not the following character would otherwise  be  interpreted
1689       character,  so it is always safe to precede a non-alphameric       as  a meta-character, so it is always safe to precede a non-
1690       with "\" to specify that it stands for itself.  In  particu-       alphameric with backslash to  specify  that  it  stands  for
1691       lar, if you want to match a backslash, you write "\\".       itself. In particular, if you want to match a backslash, you
1692         write \\.
1693    
1694       If a pattern is compiled with the PCRE_EXTENDED option, whi-       If a pattern is compiled with the PCRE_EXTENDED option, whi-
1695       tespace in the pattern (other than in a character class) and       tespace in the pattern (other than in a character class) and
1696       characters between a "#" outside a character class  and  the       characters between a # outside a  character  class  and  the
1697       next  newline  character  are ignored. An escaping backslash       next  newline  character  are ignored. An escaping backslash
1698       can be used to include a whitespace or "#" character as part       can be used to include a whitespace or # character  as  part
1699       of the pattern.       of the pattern.
1700    
1701         If you want to remove the special meaning from a sequence of
1702         characters, you can do so by putting them between \Q and \E.
1703         This is different from Perl in that $ and @ are  handled  as
1704         literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $
1705         and @ cause variable interpolation. Note the following exam-
1706         ples:
1707    
1708           Pattern            PCRE matches   Perl matches
1709    
1710           \Qabc$xyz\E        abc$xyz        abc followed by the
1711    
1712                                               contents of $xyz
1713           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1714           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1715    
1716         The \Q...\E sequence is recognized both inside  and  outside
1717         character classes.
1718    
1719       A second use of backslash provides a way  of  encoding  non-       A second use of backslash provides a way  of  encoding  non-
1720       printing  characters  in patterns in a visible manner. There       printing  characters  in patterns in a visible manner. There
1721       is no restriction on the appearance of non-printing  charac-       is no restriction on the appearance of non-printing  charac-
# Line 774  BACKSLASH Line 1724  BACKSLASH
1724       usually  easier to use one of the following escape sequences       usually  easier to use one of the following escape sequences
1725       than the binary character it represents:       than the binary character it represents:
1726    
1727         \a     alarm, that is, the BEL character (hex 07)         \a        alarm, that is, the BEL character (hex 07)
1728         \cx    "control-x", where x is any character         \cx       "control-x", where x is any character
1729         \e     escape (hex 1B)         \e        escape (hex 1B)
1730         \f     formfeed (hex 0C)         \f        formfeed (hex 0C)
1731         \n     newline (hex 0A)         \n        newline (hex 0A)
1732         \r     carriage return (hex 0D)         \r        carriage return (hex 0D)
1733           \t        tab (hex 09)
1734              tab (hex 09)         \ddd      character with octal code ddd, or backreference
1735         \xhh   character with hex code hh         \xhh      character with hex code hh
1736         \ddd   character with octal code ddd, or backreference         \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1737    
1738       The precise effect of "\cx" is as follows: if "x" is a lower       The precise effect of \cx is as follows: if  x  is  a  lower
1739       case  letter,  it  is converted to upper case. Then bit 6 of       case  letter,  it  is converted to upper case. Then bit 6 of
1740       the character (hex 40) is inverted.  Thus "\cz" becomes  hex       the character (hex 40) is inverted.  Thus  \cz  becomes  hex
1741       1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.       1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1742    
1743       After "\x", up to two hexadecimal digits are  read  (letters       After \x, from zero  to  two  hexadecimal  digits  are  read
1744       can be in upper or lower case).       (letters  can be in upper or lower case). In UTF-8 mode, any
1745         number of hexadecimal digits may appear between \x{  and  },
1746         but  the value of the character code must be less than 2**31
1747         (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If
1748         characters  other than hexadecimal digits appear between \x{
1749         and }, or if there is no terminating }, this form of  escape
1750         is  not  recognized.  Instead, the initial \x will be inter-
1751         preted as a basic  hexadecimal  escape,  with  no  following
1752         digits, giving a byte whose value is zero.
1753    
1754         Characters whose value is less than 256 can  be  defined  by
1755         either  of  the  two  syntaxes  for \x when PCRE is in UTF-8
1756         mode. There is no difference in the way  they  are  handled.
1757         For example, \xdc is exactly the same as \x{dc}.
1758    
1759       After "\0" up to two further octal digits are read. In  both       After \0 up to two further octal digits are  read.  In  both
1760       cases,  if  there are fewer than two digits, just those that       cases,  if  there are fewer than two digits, just those that
1761       are present are used. Thus the sequence "\0\x\07"  specifies       are present are used. Thus the  sequence  \0\x\07  specifies
1762       two binary zeros followed by a BEL character.  Make sure you       two binary zeros followed by a BEL character (code value 7).
1763       supply two digits after the initial zero  if  the  character       Make sure you supply two digits after the  initial  zero  if
1764       that follows is itself an octal digit.       the character that follows is itself an octal digit.
1765    
1766       The handling of a backslash followed by a digit other than 0       The handling of a backslash followed by a digit other than 0
1767       is  complicated.   Outside  a character class, PCRE reads it       is  complicated.   Outside  a character class, PCRE reads it
# Line 824  BACKSLASH Line 1787  BACKSLASH
1787                   writing a tab                   writing a tab
1788         \011   is always a tab         \011   is always a tab
1789         \0113  is a tab followed by the character "3"         \0113  is a tab followed by the character "3"
1790         \113   is the character with octal code 113 (since there         \113   might be a back reference, otherwise the
1791                   can be no more than 99 back references)                   character with octal code 113
1792         \377   is a byte consisting entirely of 1 bits         \377   might be a back reference, otherwise
1793                     the byte consisting entirely of 1 bits
1794         \81    is either a back reference, or a binary zero         \81    is either a back reference, or a binary zero
1795                   followed by the two characters "8" and "1"                   followed by the two characters "8" and "1"
1796    
1797       Note that octal values of 100 or greater must not be  intro-       Note that octal values of 100 or greater must not be  intro-
1798       duced  by  a  leading zero, because no more than three octal       duced  by  a  leading zero, because no more than three octal
1799       digits are ever read.       digits are ever read.
1800       All the sequences that define a single  byte  value  can  be  
1801       used both inside and outside character classes. In addition,       All the sequences that define a single byte value or a  sin-
1802       inside a character class, the sequence "\b"  is  interpreted       gle  UTF-8 character (in UTF-8 mode) can be used both inside
1803       as  the  backspace  character  (hex 08). Outside a character       and outside character classes. In addition, inside a charac-
1804       class it has a different meaning (see below).       ter  class,  the sequence \b is interpreted as the backspace
1805         character (hex 08). Outside a character class it has a  dif-
1806         ferent meaning (see below).
1807    
1808       The third use of backslash is for specifying generic charac-       The third use of backslash is for specifying generic charac-
1809       ter types:       ter types:
# Line 847  BACKSLASH Line 1813  BACKSLASH
1813         \s     any whitespace character         \s     any whitespace character
1814         \S     any character that is not a whitespace character         \S     any character that is not a whitespace character
1815         \w     any "word" character         \w     any "word" character
1816         \W     any "non-word" character         W     any "non-word" character
1817    
1818       Each pair of escape sequences partitions the complete set of       Each pair of escape sequences partitions the complete set of
1819       characters  into  two  disjoint  sets.  Any  given character       characters  into  two  disjoint  sets.  Any  given character
1820       matches one, and only one, of each pair.       matches one, and only one, of each pair.
1821    
1822         In UTF-8 mode, characters with values greater than 255 never
1823         match \d, \s, or \w, and always match \D, \S, and \W.
1824    
1825         For compatibility with Perl, \s does not match the VT  char-
1826         acter (code 11).  This makes it different from the the POSIX
1827         "space" class. The \s characters are HT  (9),  LF  (10),  FF
1828         (12), CR (13), and space (32).
1829    
1830       A "word" character is any letter or digit or the  underscore       A "word" character is any letter or digit or the  underscore
1831       character,  that  is,  any  character which can be part of a       character,  that  is,  any  character which can be part of a
1832       Perl "word". The definition of letters and  digits  is  con-       Perl "word". The definition of letters and  digits  is  con-
1833       trolled  by PCRE's character tables, and may vary if locale-       trolled  by PCRE's character tables, and may vary if locale-
1834       specific matching is  taking  place  (see  "Locale  support"       specific matching is taking place (see "Locale  support"  in
1835       above). For example, in the "fr" (French) locale, some char-       the pcreapi page). For example, in the "fr" (French) locale,
1836       acter codes greater than 128 are used for accented  letters,       some character codes greater than 128 are used for  accented
1837       and these are matched by \w.       letters, and these are matched by \w.
1838    
1839       These character type sequences can appear  both  inside  and       These character type sequences can appear  both  inside  and
1840       outside  character classes. They each match one character of       outside  character classes. They each match one character of
# Line 875  BACKSLASH Line 1849  BACKSLASH
1849       for more complicated  assertions  is  described  below.  The       for more complicated  assertions  is  described  below.  The
1850       backslashed assertions are       backslashed assertions are
1851    
1852         \b     word boundary         \b     matches at a word boundary
1853         \B     not a word boundary         \B     matches when not at a word boundary
1854         \A     start of subject (independent of multiline mode)         \A     matches at start of subject
1855         \Z     end of subject or newline at  end  (independent  of         \Z     matches at end of subject or before newline at end
1856       multiline mode)         \z     matches at end of subject
1857         \z     end of subject (independent of multiline mode)         \G     matches at first matching position in subject
1858    
1859       These assertions may not appear in  character  classes  (but       These assertions may not appear in  character  classes  (but
1860       note that "\b" has a different meaning, namely the backspace       note  that  \b has a different meaning, namely the backspace
1861       character, inside a character class).       character, inside a character class).
1862    
1863       A word boundary is a position in the  subject  string  where       A word boundary is a position in the  subject  string  where
1864       the current character and the previous character do not both       the current character and the previous character do not both
1865       match \w or \W (i.e. one matches \w and  the  other  matches       match \w or \W (i.e. one matches \w and  the  other  matches
1866       \W),  or the start or end of the string if the first or last       \W),  or the start or end of the string if the first or last
1867       character matches \w, respectively.       character matches \w, respectively.
   
1868       The \A, \Z, and \z assertions differ  from  the  traditional       The \A, \Z, and \z assertions differ  from  the  traditional
1869       circumflex  and  dollar  (described below) in that they only       circumflex  and  dollar  (described below) in that they only
1870       ever match at the very start and end of the subject  string,       ever match at the very start and end of the subject  string,
1871       whatever  options  are  set.  They  are  not affected by the       whatever options are set. Thus, they are independent of mul-
1872       PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-       tiline mode.
1873       ment  of  pcre_exec()  is  non-zero, \A can never match. The  
1874         They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL
1875         options.  If the startoffset argument of pcre_exec() is non-
1876         zero, indicating that matching is to start at a point  other
1877         than  the  beginning of the subject, \A can never match. The
1878       difference between \Z and \z is that  \Z  matches  before  a       difference between \Z and \z is that  \Z  matches  before  a
1879       newline  that is the last character of the string as well as       newline  that is the last character of the string as well as
1880       at the end of the string, whereas \z  matches  only  at  the       at the end of the string, whereas \z  matches  only  at  the
1881       end.       end.
1882    
1883         The \G assertion is true  only  when  the  current  matching
1884         position is at the start point of the match, as specified by
1885         the startoffset argument of pcre_exec(). It differs from  \A
1886         when  the  value  of  startoffset  is  non-zero.  By calling
1887         pcre_exec() multiple times with appropriate  arguments,  you
1888         can mimic Perl's /g option, and it is in this kind of imple-
1889         mentation where \G can be useful.
1890    
1891         Note, however, that PCRE's  interpretation  of  \G,  as  the
1892         start of the current match, is subtly different from Perl's,
1893         which defines it as the end of the previous match. In  Perl,
1894         these  can  be  different when the previously matched string
1895         was empty. Because PCRE does just one match at  a  time,  it
1896         cannot reproduce this behaviour.
1897    
1898         If all the alternatives of a  pattern  begin  with  \G,  the
1899         expression  is  anchored to the starting match position, and
1900         the "anchored" flag is set in the compiled  regular  expres-
1901         sion.
1902    
1903    
1904  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
1905    
1906       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1907       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1908       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1909       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1910       zero, circumflex can never match. Inside a character  class,       zero, circumflex  can  never  match  if  the  PCRE_MULTILINE
1911       circumflex has an entirely different meaning (see below).       option is unset. Inside a character class, circumflex has an
1912         entirely different meaning (see below).
1913    
1914       Circumflex need not be the first character of the pattern if       Circumflex need not be the first character of the pattern if
1915       a  number of alternatives are involved, but it should be the       a  number of alternatives are involved, but it should be the
# Line 932  CIRCUMFLEX AND DOLLAR Line 1931  CIRCUMFLEX AND DOLLAR
1931    
1932       The meaning of dollar can be changed so that it matches only       The meaning of dollar can be changed so that it matches only
1933       at   the   very   end   of   the   string,  by  setting  the       at   the   very   end   of   the   string,  by  setting  the
1934       PCRE_DOLLAR_ENDONLY option at compile or matching time. This       PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not
1935       does not affect the \Z assertion.       affect the \Z assertion.
1936    
1937       The meanings of the circumflex  and  dollar  characters  are       The meanings of the circumflex  and  dollar  characters  are
1938       changed  if  the  PCRE_MULTILINE option is set. When this is       changed  if  the  PCRE_MULTILINE option is set. When this is
1939       the case,  they  match  immediately  after  and  immediately       the case,  they  match  immediately  after  and  immediately
1940       before an internal "\n" character, respectively, in addition       before an internal newline character, respectively, in addi-
1941       to matching at the start and end of the subject string.  For       tion to matching at the start and end of the subject string.
1942       example,  the  pattern  /^abc$/  matches  the subject string       For  example, the pattern /^abc$/ matches the subject string
1943       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
1944       quently,  patterns  that  are  anchored  in single line mode       quently,  patterns  that  are  anchored  in single line mode
1945       because all branches start with "^" are not anchored in mul-       because all branches start with ^ are not anchored in multi-
1946       tiline mode, and a match for circumflex is possible when the       line  mode,  and a match for circumflex is possible when the
1947       startoffset  argument  of  pcre_exec()  is   non-zero.   The       startoffset  argument  of  pcre_exec()  is   non-zero.   The
1948       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is
1949       set.       set.
1950    
1951       Note that the sequences \A, \Z, and \z can be used to  match       Note that the sequences \A, \Z, and \z can be used to  match
1952       the  start  and end of the subject in both modes, and if all       the  start  and end of the subject in both modes, and if all
1953       branches of a pattern start with \A is it  always  anchored,       branches of a pattern start with \A it is  always  anchored,
1954       whether PCRE_MULTILINE is set or not.       whether PCRE_MULTILINE is set or not.
1955    
1956    
   
1957  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
1958    
1959       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1960       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1961       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default) newline.  In UTF-8 mode,  a  dot
1962       option  is  set,  then dots match newlines as well. The han-       matches  any  UTF-8  character, which might be more than one
1963       dling of dot is entirely independent of the handling of cir-       byte  long,  except  (by  default)  for  newline.   If   the
1964       cumflex  and  dollar,  the only relationship being that they       PCRE_DOTALL  option is set, dots match newlines as well. The
1965       both involve newline characters.  Dot has no special meaning       handling of dot is entirely independent of the  handling  of
1966         circumflex and dollar, the only relationship being that they
1967         both involve newline characters. Dot has no special  meaning
1968       in a character class.       in a character class.
1969    
1970    
1971    
1972    MATCHING A SINGLE BYTE
1973    
1974         Outside a character class, the escape  sequence  \C  matches
1975         any  one  byte, both in and out of UTF-8 mode. Unlike a dot,
1976         it always matches a newline. The feature is provided in Perl
1977         in  order  to match individual bytes in UTF-8 mode.  Because
1978         it breaks up UTF-8 characters into  individual  bytes,  what
1979         remains  in  the string may be a malformed UTF-8 string. For
1980         this reason it is best avoided.
1981    
1982         PCRE does not allow \C to appear  in  lookbehind  assertions
1983         (see below), because in UTF-8 mode it makes it impossible to
1984         calculate the length of the lookbehind.
1985    
1986    
1987  SQUARE BRACKETS  SQUARE BRACKETS
1988    
1989       An opening square bracket introduces a character class, ter-       An opening square bracket introduces a character class, ter-
1990       minated  by  a  closing  square  bracket.  A  closing square       minated  by  a  closing  square  bracket.  A  closing square
1991       bracket on its own is  not  special.  If  a  closing  square       bracket on its own is  not  special.  If  a  closing  square
# Line 976  SQUARE BRACKETS Line 1993  SQUARE BRACKETS
1993       the first data character in the class (after an initial cir-       the first data character in the class (after an initial cir-
1994       cumflex, if present) or escaped with a backslash.       cumflex, if present) or escaped with a backslash.
1995    
1996       A character class matches a single character in the subject;       A character class matches a single character in the subject.
1997       the  character  must  be in the set of characters defined by       In  UTF-8 mode, the character may occupy more than one byte.
1998       the class, unless the first character in the class is a cir-       A matched character must be in the set of characters defined
1999       cumflex,  in which case the subject character must not be in       by the class, unless the first character in the class defin-
2000       the set defined by the class. If a  circumflex  is  actually       ition is a circumflex, in which case the  subject  character
2001       required  as  a  member  of  the class, ensure it is not the       must not be in the set defined by the class. If a circumflex
2002       first character, or escape it with a backslash.       is actually required as a member of the class, ensure it  is
2003         not the first character, or escape it with a backslash.
2004    
2005       For example, the character class [aeiou] matches  any  lower       For example, the character class [aeiou] matches  any  lower
2006       case vowel, while [^aeiou] matches any character that is not       case vowel, while [^aeiou] matches any character that is not
# Line 993  SQUARE BRACKETS Line 2011  SQUARE BRACKETS
2011       string, and fails if the current pointer is at  the  end  of       string, and fails if the current pointer is at  the  end  of
2012       the string.       the string.
2013    
2014         In UTF-8 mode, characters with values greater than  255  can
2015         be  included  in a class as a literal string of bytes, or by
2016         using the \x{ escaping mechanism.
2017    
2018       When caseless matching  is  set,  any  letters  in  a  class       When caseless matching  is  set,  any  letters  in  a  class
2019       represent  both their upper case and lower case versions, so       represent  both their upper case and lower case versions, so
2020       for example, a caseless [aeiou] matches "A" as well as  "a",       for example, a caseless [aeiou] matches "A" as well as  "a",
2021       and  a caseless [^aeiou] does not match "A", whereas a case-       and  a caseless [^aeiou] does not match "A", whereas a case-
2022       ful version would.       ful version would. PCRE does not support the concept of case
2023         for characters with values greater than 255.
2024       The newline character is never treated in any special way in       The newline character is never treated in any special way in
2025       character  classes,  whatever the setting of the PCRE_DOTALL       character  classes,  whatever the setting of the PCRE_DOTALL
2026       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will
# Line 1022  SQUARE BRACKETS Line 2044  SQUARE BRACKETS
2044       separate characters. The octal or hexadecimal representation       separate characters. The octal or hexadecimal representation
2045       of "]" can also be used to end a range.       of "]" can also be used to end a range.
2046    
2047       Ranges operate in ASCII collating sequence. They can also be       Ranges  operate  in  the  collating  sequence  of  character
2048       used  for  characters  specified  numerically,  for  example       values.  They  can  also  be  used  for characters specified
2049       [\000-\037]. If a range that includes letters is  used  when       numerically, for example [\000-\037]. In UTF-8 mode,  ranges
2050       caseless  matching  is set, it matches the letters in either       can  include  characters  whose values are greater than 255,
2051       case. For example, [W-c] is equivalent  to  [][\^_`wxyzabc],       for example [\x{100}-\x{2ff}].
2052       matched  caselessly,  and  if  character tables for the "fr"  
2053       locale are in use, [\xc8-\xcb] matches accented E characters       If a range that  includes  letters  is  used  when  caseless
2054       in both cases.       matching  is set, it matches the letters in either case. For
2055         example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched
2056         caselessly,  and if character tables for the "fr" locale are
2057         in use, [\xc8-\xcb] matches accented E  characters  in  both
2058         cases.
2059    
2060       The character types \d, \D, \s, \S,  \w,  and  \W  may  also       The character types \d, \D, \s, \S,  \w,  and  \W  may  also
2061       appear  in  a  character  class, and add the characters that       appear  in  a  character  class, and add the characters that
# Line 1045  SQUARE BRACKETS Line 2071  SQUARE BRACKETS
2071       classes, but it does no harm if they are escaped.       classes, but it does no harm if they are escaped.
2072    
2073    
2074    POSIX CHARACTER CLASSES
2075    
2076         Perl supports the  POSIX  notation  for  character  classes,
2077         which  uses names enclosed by [: and :] within the enclosing
2078         square brackets. PCRE also supports this notation. For exam-
2079         ple,
2080    
2081           [01[:alpha:]%]
2082    
2083         matches "0", "1", any alphabetic character, or "%". The sup-
2084         ported class names are
2085    
2086           alnum    letters and digits
2087           alpha    letters
2088           ascii    character codes 0 - 127
2089           blank    space or tab only
2090           cntrl    control characters
2091           digit    decimal digits (same as \d)
2092           graph    printing characters, excluding space
2093           lower    lower case letters
2094           print    printing characters, including space
2095           punct    printing characters, excluding letters and digits
2096           space    white space (not quite the same as \s)
2097           upper    upper case letters
2098           word     "word" characters (same as \w)
2099           xdigit   hexadecimal digits
2100    
2101         The "space" characters are HT (9),  LF  (10),  VT  (11),  FF
2102         (12),  CR  (13),  and  space  (32).  Notice  that  this list
2103         includes the VT character (code 11). This makes "space" dif-
2104         ferent  to  \s, which does not include VT (for Perl compati-
2105         bility).
2106    
2107         The name "word" is a Perl extension, and "blank"  is  a  GNU
2108         extension from Perl 5.8. Another Perl extension is negation,
2109         which is indicated by a ^ character  after  the  colon.  For
2110         example,
2111    
2112           [12[:^digit:]]
2113    
2114         matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also
2115         recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2116         "collating element", but these are  not  supported,  and  an
2117         error is given if they are encountered.
2118    
2119         In UTF-8 mode, characters with values greater  than  255  do
2120         not match any of the POSIX character classes.
2121    
2122    
2123  VERTICAL BAR  VERTICAL BAR
2124    
2125       Vertical bar characters are  used  to  separate  alternative       Vertical bar characters are  used  to  separate  alternative
2126       patterns. For example, the pattern       patterns. For example, the pattern
2127    
# Line 1062  VERTICAL BAR Line 2137  VERTICAL BAR
2137       subpattern.       subpattern.
2138    
2139    
   
2140  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
2141       The settings of PCRE_CASELESS, PCRE_MULTILINE,  PCRE_DOTALL,  
2142       and  PCRE_EXTENDED can be changed from within the pattern by       The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,
2143       a sequence of Perl option letters enclosed between "(?"  and       PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from
2144       ")". The option letters are       within the pattern by a  sequence  of  Perl  option  letters
2145         enclosed between "(?" and ")". The option letters are
2146    
2147         i  for PCRE_CASELESS         i  for PCRE_CASELESS
2148         m  for PCRE_MULTILINE         m  for PCRE_MULTILINE
# Line 1082  INTERNAL OPTION SETTING Line 2157  INTERNAL OPTION SETTING
2157       If  a  letter  appears both before and after the hyphen, the       If  a  letter  appears both before and after the hyphen, the
2158       option is unset.       option is unset.
2159    
2160       The scope of these option changes depends on  where  in  the       When an option change occurs at  top  level  (that  is,  not
2161       pattern  the  setting  occurs. For settings that are outside       inside  subpattern  parentheses),  the change applies to the
2162       any subpattern (defined below), the effect is the same as if       remainder of the pattern that follows.   If  the  change  is
2163       the  options were set or unset at the start of matching. The       placed  right  at  the  start of a pattern, PCRE extracts it
2164       following patterns all behave in exactly the same way:       into the global options (and it will therefore  show  up  in
2165         data extracted by the pcre_fullinfo() function).
2166         (?i)abc  
2167         a(?i)bc       An option change within a subpattern affects only that  part
2168         ab(?i)c       of the current pattern that follows it, so
        abc(?i)  
   
      which in turn is the same as compiling the pattern abc  with  
      PCRE_CASELESS  set.   In  other words, such "top level" set-  
      tings apply to the whole pattern  (unless  there  are  other  
      changes  inside subpatterns). If there is more than one set-  
      ting of the same option at top level, the rightmost  setting  
      is used.  
   
      If an option change occurs inside a subpattern,  the  effect  
      is  different.  This is a change of behaviour in Perl 5.005.  
      An option change inside a subpattern affects only that  part  
      of the subpattern that follows it, so  
2169    
2170         (a(?i)b)c         (a(?i)b)c
2171    
# Line 1130  INTERNAL OPTION SETTING Line 2192  INTERNAL OPTION SETTING
2192       even when it is at top level. It is best put at the start.       even when it is at top level. It is best put at the start.
2193    
2194    
   
2195  SUBPATTERNS  SUBPATTERNS
2196    
2197       Subpatterns are delimited by parentheses  (round  brackets),       Subpatterns are delimited by parentheses  (round  brackets),
2198       which can be nested.  Marking part of a pattern as a subpat-       which can be nested.  Marking part of a pattern as a subpat-
2199       tern does two things:       tern does two things:
# Line 1159  SUBPATTERNS Line 2221  SUBPATTERNS
2221         the ((red|white) (king|queen))         the ((red|white) (king|queen))
2222    
2223       the captured substrings are "red king", "red",  and  "king",       the captured substrings are "red king", "red",  and  "king",
2224       and are numbered 1, 2, and 3.       and are numbered 1, 2, and 3, respectively.
2225    
2226       The fact that plain parentheses fulfil two functions is  not       The fact that plain parentheses fulfil two functions is  not
2227       always  helpful.  There are often times when a grouping sub-       always  helpful.  There are often times when a grouping sub-
2228       pattern is required without a capturing requirement.  If  an       pattern is required without a capturing requirement.  If  an
2229       opening parenthesis is followed by "?:", the subpattern does       opening  parenthesis  is  followed  by a question mark and a
2230       not do any capturing, and is not counted when computing  the       colon, the subpattern does not do any capturing, and is  not
2231       number of any subsequent capturing subpatterns. For example,       counted  when computing the number of any subsequent captur-
2232       if the string "the white queen" is matched against the  pat-       ing subpatterns. For  example,  if  the  string  "the  white
2233       tern       queen" is matched against the pattern
2234    
2235         the ((?:red|white) (king|queen))         the ((?:red|white) (king|queen))
2236    
2237       the captured substrings are "white queen" and  "queen",  and       the captured substrings are "white queen" and  "queen",  and
2238       are  numbered  1  and 2. The maximum number of captured sub-       are  numbered  1 and 2. The maximum number of capturing sub-
2239       strings is 99, and the maximum number  of  all  subpatterns,       patterns is 65535, and the maximum depth of nesting  of  all
2240       both capturing and non-capturing, is 200.       subpatterns, both capturing and non-capturing, is 200.
2241    
2242       As a  convenient  shorthand,  if  any  option  settings  are       As a  convenient  shorthand,  if  any  option  settings  are
2243       required  at  the  start  of a non-capturing subpattern, the       required  at  the  start  of a non-capturing subpattern, the
# Line 1192  SUBPATTERNS Line 2254  SUBPATTERNS
2254       the above patterns match "SUNDAY" as well as "Saturday".       the above patterns match "SUNDAY" as well as "Saturday".
2255    
2256    
2257    NAMED SUBPATTERNS
2258    
2259         Identifying capturing parentheses by number is  simple,  but
2260         it  can be very hard to keep track of the numbers in compli-
2261         cated regular expressions. Furthermore, if an expression  is
2262         modified,  the  numbers  may change. To help with the diffi-
2263         culty, PCRE supports the naming  of  subpatterns,  something
2264         that  Perl does not provide. The Python syntax (?P<name>...)
2265         is used. Names consist of alphanumeric characters and under-
2266         scores, and must be unique within a pattern.
2267    
2268         Named capturing parentheses are still allocated  numbers  as
2269         well  as  names.  The  PCRE  API provides function calls for
2270         extracting the name-to-number translation table from a  com-
2271         piled  pattern. For further details see the pcreapi documen-
2272         tation.
2273    
2274    
2275  REPETITION  REPETITION
2276    
2277       Repetition is specified by quantifiers, which can follow any       Repetition is specified by quantifiers, which can follow any
2278       of the following items:       of the following items:
2279    
2280           a literal data character
        a single character, possibly escaped  
2281         the . metacharacter         the . metacharacter
2282           the \C escape sequence
2283           escapes such as \d that match single characters
2284         a character class         a character class
2285         a back reference (see next section)         a back reference (see next section)
2286         a parenthesized subpattern (unless it is  an  assertion  -         a parenthesized subpattern (unless it is an assertion)
      see below)  
2287    
2288       The general repetition quantifier specifies  a  minimum  and       The general repetition quantifier specifies  a  minimum  and
2289       maximum  number  of  permitted  matches,  by  giving the two       maximum  number  of  permitted  matches,  by  giving the two
# Line 1232  REPETITION Line 2312  REPETITION
2312       as  a literal character. For example, {,6} is not a quantif-       as  a literal character. For example, {,6} is not a quantif-
2313       ier, but a literal string of four characters.       ier, but a literal string of four characters.
2314    
2315         In UTF-8 mode, quantifiers apply to UTF-8 characters  rather
2316         than  to  individual  bytes.  Thus,  for example, \x{100}{2}
2317         matches two UTF-8 characters, each of which  is  represented
2318         by a two-byte sequence.
2319    
2320       The quantifier {0} is permitted, causing the  expression  to       The quantifier {0} is permitted, causing the  expression  to
2321       behave  as  if the previous item and the quantifier were not       behave  as  if the previous item and the quantifier were not
2322       present.       present.
# Line 1270  REPETITION Line 2355  REPETITION
2355    
2356         /* first command */  not comment  /* second comment */         /* first command */  not comment  /* second comment */
2357    
2358       fails, because it matches  the  entire  string  due  to  the       fails, because it matches the entire  string  owing  to  the
2359       greediness of the .*  item.       greediness of the .*  item.
2360    
2361       However, if a quantifier is followed  by  a  question  mark,       However, if a quantifier is followed by a question mark,  it
2362       then it ceases to be greedy, and instead matches the minimum       ceases  to be greedy, and instead matches the minimum number
2363       number of times possible, so the pattern       of times possible, so the pattern
2364    
2365         /\*.*?\*/         /\*.*?\*/
2366    
# Line 1292  REPETITION Line 2377  REPETITION
2377       that is the only way the rest of the pattern matches.       that is the only way the rest of the pattern matches.
2378    
2379       If the PCRE_UNGREEDY option is set (an option which  is  not       If the PCRE_UNGREEDY option is set (an option which  is  not
2380       available  in  Perl)  then the quantifiers are not greedy by       available  in  Perl),  the  quantifiers  are  not  greedy by
2381       default, but individual ones can be made greedy by following       default, but individual ones can be made greedy by following
2382       them  with  a  question mark. In other words, it inverts the       them  with  a  question mark. In other words, it inverts the
2383       default behaviour.       default behaviour.
# Line 1301  REPETITION Line 2386  REPETITION
2386       repeat  count  that is greater than 1 or with a limited max-       repeat  count  that is greater than 1 or with a limited max-
2387       imum, more store is required for the  compiled  pattern,  in       imum, more store is required for the  compiled  pattern,  in
2388       proportion to the size of the minimum or maximum.       proportion to the size of the minimum or maximum.
   
2389       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
2390       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
2391       to match newlines, then the pattern is implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
2392       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
2393       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
2394       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
2395       PCRE treats such a pattern as though it were preceded by \A.       PCRE normally treats such a pattern as though it  were  pre-
2396       In  cases where it is known that the subject string contains       ceded by \A.
2397       no newlines, it is worth setting PCRE_DOTALL when  the  pat-  
2398       tern begins with .* in order to obtain this optimization, or       In cases where it is known that the subject string  contains
2399       alternatively using ^ to indicate anchoring explicitly.       no  newlines,  it  is  worth setting PCRE_DOTALL in order to
2400         obtain this optimization, or alternatively using ^ to  indi-
2401         cate anchoring explicitly.
2402    
2403         However, there is one situation where the optimization  can-
2404         not  be  used. When .*  is inside capturing parentheses that
2405         are the subject of a backreference elsewhere in the pattern,
2406         a match at the start may fail, and a later one succeed. Con-
2407         sider, for example:
2408    
2409           (.*)abc\1
2410    
2411         If the subject is "xyz123abc123"  the  match  point  is  the
2412         fourth  character.  For  this  reason, such a pattern is not
2413         implicitly anchored.
2414    
2415       When a capturing subpattern is repeated, the value  captured       When a capturing subpattern is repeated, the value  captured
2416       is the substring that matched the final iteration. For exam-       is the substring that matched the final iteration. For exam-
# Line 1332  REPETITION Line 2430  REPETITION
2430       "b".       "b".
2431    
2432    
2433    ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2434    
2435         With both maximizing and minimizing repetition,  failure  of
2436         what  follows  normally  causes  the repeated item to be re-
2437         evaluated to see if a different number of repeats allows the
2438         rest  of  the  pattern  to  match. Sometimes it is useful to
2439         prevent this, either to change the nature of the  match,  or
2440         to  cause  it fail earlier than it otherwise might, when the
2441         author of the pattern knows there is no  point  in  carrying
2442         on.
2443    
2444         Consider, for example, the pattern \d+foo  when  applied  to
2445         the subject line
2446    
2447           123456bar
2448    
2449         After matching all 6 digits and then failing to match "foo",
2450         the normal action of the matcher is to try again with only 5
2451         digits matching the \d+ item, and then with 4,  and  so  on,
2452         before  ultimately  failing. "Atomic grouping" (a term taken
2453         from Jeffrey Friedl's book) provides the means for  specify-
2454         ing  that once a subpattern has matched, it is not to be re-
2455         evaluated in this way.
2456    
2457         If we use atomic grouping  for  the  previous  example,  the
2458         matcher  would give up immediately on failing to match "foo"
2459         the  first  time.  The  notation  is  a  kind   of   special
2460         parenthesis, starting with (?> as in this example:
2461    
2462           (?>\d+)bar
2463    
2464         This kind of parenthesis "locks up" the  part of the pattern
2465         it  contains once it has matched, and a failure further into
2466         the pattern is prevented from backtracking  into  it.  Back-
2467         tracking  past  it to previous items, however, works as nor-
2468         mal.
2469    
2470         An alternative description is that a subpattern of this type
2471         matches  the  string  of  characters that an identical stan-
2472         dalone pattern would match, if anchored at the current point
2473         in the subject string.
2474    
2475         Atomic grouping subpatterns are not  capturing  subpatterns.
2476         Simple  cases such as the above example can be thought of as
2477         a maximizing repeat that must swallow everything it can. So,
2478         while both \d+ and \d+? are prepared to adjust the number of
2479         digits they match in order to make the rest of  the  pattern
2480         match, (?>\d+) can only match an entire sequence of digits.
2481    
2482         Atomic groups in general can of course  contain  arbitrarily
2483         complicated  subpatterns,  and  can be nested. However, when
2484         the subpattern for an atomic group is just a single repeated
2485         item,  as in the example above, a simpler notation, called a
2486         "possessive quantifier" can be used.  This  consists  of  an
2487         additional  +  character  following a quantifier. Using this
2488         notation, the previous example can be rewritten as
2489    
2490           \d++bar
2491    
2492         Possessive quantifiers are always greedy; the setting of the
2493         PCRE_UNGREEDY option is ignored. They are a convenient nota-
2494         tion for the simpler forms of atomic group.  However,  there
2495         is  no  difference in the meaning or processing of a posses-
2496         sive quantifier and the equivalent atomic group.
2497    
2498         The possessive quantifier syntax is an extension to the Perl
2499         syntax. It originates in Sun's Java package.
2500    
2501         When a pattern contains an unlimited repeat inside a subpat-
2502         tern  that  can  itself  be  repeated an unlimited number of
2503         times, the use of an atomic group is the only way  to  avoid
2504         some  failing  matches  taking  a very long time indeed. The
2505         pattern
2506    
2507           (\D+|<\d+>)*[!?]
2508    
2509         matches an unlimited number of substrings that  either  con-
2510         sist  of  non-digits,  or digits enclosed in <>, followed by
2511         either ! or ?. When it matches, it runs quickly. However, if
2512         it is applied to
2513    
2514           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2515    
2516         it takes a long  time  before  reporting  failure.  This  is
2517         because the string can be divided between the two repeats in
2518         a large number of ways, and all have to be tried. (The exam-
2519         ple  used  [!?]  rather  than a single character at the end,
2520         because both PCRE and Perl have an optimization that  allows
2521         for  fast  failure  when  a  single  character is used. They
2522         remember the last single character that is  required  for  a
2523         match,  and  fail early if it is not present in the string.)
2524         If the pattern is changed to
2525    
2526           ((?>\D+)|<\d+>)*[!?]
2527    
2528         sequences of non-digits cannot be broken, and  failure  hap-
2529         pens quickly.
2530    
2531    
2532  BACK REFERENCES  BACK REFERENCES
2533    
2534       Outside a character class, a backslash followed by  a  digit       Outside a character class, a backslash followed by  a  digit
2535       greater  than  0  (and  possibly  further  digits) is a back       greater  than  0  (and  possibly  further  digits) is a back
2536       reference to a capturing subpattern  earlier  (i.e.  to  its       reference to a capturing subpattern earlier (that is, to its
2537       left)  in  the  pattern,  provided there have been that many       left)  in  the  pattern,  provided there have been that many
2538       previous capturing left parentheses.       previous capturing left parentheses.
2539    
# Line 1351  BACK REFERENCES Line 2548  BACK REFERENCES
2548    
2549       A back reference matches whatever actually matched the  cap-       A back reference matches whatever actually matched the  cap-
2550       turing subpattern in the current subject string, rather than       turing subpattern in the current subject string, rather than
2551       anything matching the subpattern itself. So the pattern       anything matching the subpattern itself (see "Subpatterns as
2552         subroutines" below for a way of doing that). So the pattern
2553    
2554         (sens|respons)e and \1ibility         (sens|respons)e and \1ibility
2555    
2556       matches "sense and sensibility" and "response and  responsi-       matches "sense and sensibility" and "response and  responsi-
2557       bility",  but  not  "sense  and  responsibility". If caseful       bility",  but  not  "sense  and  responsibility". If caseful
2558       matching is in force at the time of the back reference, then       matching is in force at the time of the back reference,  the
2559       the case of letters is relevant. For example,       case of letters is relevant. For example,
2560    
2561         ((?i)rah)\s+\1         ((?i)rah)\s+\1
2562    
# Line 1366  BACK REFERENCES Line 2564  BACK REFERENCES
2564       though  the  original  capturing subpattern is matched case-       though  the  original  capturing subpattern is matched case-
2565       lessly.       lessly.
2566    
2567         Back references to named subpatterns use the  Python  syntax
2568         (?P=name). We could rewrite the above example as follows:
2569    
2570           (?<p1>(?i)rah)\s+(?P=p1)
2571    
2572       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
2573       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
2574       particular match, then any  back  references  to  it  always       particular match, any back references to it always fail. For
2575       fail. For example, the pattern       example, the pattern
2576    
2577         (a|(bc))\2         (a|(bc))\2
2578    
2579       always fails if it starts to match  "a"  rather  than  "bc".       always fails if it starts to match  "a"  rather  than  "bc".
2580       Because  there  may  be up to 99 back references, all digits       Because  there  may  be many capturing parentheses in a pat-
2581       following the backslash are taken as  part  of  a  potential       tern, all digits following the backslash are taken  as  part
2582       back reference number. If the pattern continues with a digit       of a potential back reference number. If the pattern contin-
2583       character, then some delimiter must be used to terminate the       ues with a digit character, some delimiter must be  used  to
2584       back reference. If the PCRE_EXTENDED option is set, this can       terminate the back reference. If the PCRE_EXTENDED option is
2585       be whitespace.  Otherwise an empty comment can be used.       set, this can be whitespace.  Otherwise an empty comment can
2586         be used.
2587    
2588       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
2589       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
# Line 1389  BACK REFERENCES Line 2593  BACK REFERENCES
2593    
2594         (a|b\1)+         (a|b\1)+
2595    
2596       matches any number of "a"s and also "aba", "ababaa" etc.  At       matches any number of "a"s and also "aba", "ababbaa" etc. At
2597       each iteration of the subpattern, the back reference matches       each iteration of the subpattern, the back reference matches
2598       the character string corresponding to  the  previous  itera-       the character string corresponding to  the  previous  itera-
2599       tion.  In  order  for this to work, the pattern must be such       tion.  In  order  for this to work, the pattern must be such
# Line 1398  BACK REFERENCES Line 2602  BACK REFERENCES
2602       example above, or by a quantifier with a minimum of zero.       example above, or by a quantifier with a minimum of zero.
2603    
2604    
   
2605  ASSERTIONS  ASSERTIONS
2606    
2607       An assertion is  a  test  on  the  characters  following  or       An assertion is  a  test  on  the  characters  following  or
2608       preceding  the current matching point that does not actually       preceding  the current matching point that does not actually
2609       consume any characters. The simple assertions coded  as  \b,       consume any characters. The simple assertions coded  as  \b,
2610       \B,  \A,  \Z,  \z, ^ and $ are described above. More compli-       \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-
2611       cated assertions are coded as  subpatterns.  There  are  two       plicated assertions are coded as subpatterns. There are  two
2612       kinds:  those that look ahead of the current position in the       kinds:  those that look ahead of the current position in the
2613       subject string, and those that look behind it.       subject string, and those that look behind it.
2614    
2615       An assertion subpattern is matched in the normal way, except       An assertion subpattern is matched in the normal way, except
2616       that  it  does not cause the current matching position to be       that  it  does not cause the current matching position to be
2617       changed. Lookahead assertions start with  (?=  for  positive       changed. Lookahead assertions start with  (?=  for  positive
# Line 1430  ASSERTIONS Line 2635  ASSERTIONS
2635       when  the  next  three  characters  are  "bar". A lookbehind       when  the  next  three  characters  are  "bar". A lookbehind
2636       assertion is needed to achieve this effect.       assertion is needed to achieve this effect.
2637    
2638         If you want to force a matching failure at some point  in  a
2639         pattern,  the  most  convenient  way  to  do it is with (?!)
2640         because an empty string always matches, so an assertion that
2641         requires there not to be an empty string must always fail.
2642    
2643       Lookbehind assertions start with (?<=  for  positive  asser-       Lookbehind assertions start with (?<=  for  positive  asser-
2644       tions and (?<! for negative assertions. For example,       tions and (?<! for negative assertions. For example,
2645    
# Line 1450  ASSERTIONS Line 2660  ASSERTIONS
2660       causes an error at compile time. Branches  that  match  dif-       causes an error at compile time. Branches  that  match  dif-
2661       ferent length strings are permitted only at the top level of       ferent length strings are permitted only at the top level of
2662       a lookbehind assertion. This is an extension  compared  with       a lookbehind assertion. This is an extension  compared  with
2663       Perl  5.005,  which  requires all branches to match the same       Perl  (at  least  for  5.8),  which requires all branches to
2664       length of string. An assertion such as       match the same length of string. An assertion such as
2665    
2666         (?<=ab(c|de))         (?<=ab(c|de))
2667    
# Line 1465  ASSERTIONS Line 2675  ASSERTIONS
2675       alternative,  to  temporarily move the current position back       alternative,  to  temporarily move the current position back
2676       by the fixed width and then  try  to  match.  If  there  are       by the fixed width and then  try  to  match.  If  there  are
2677       insufficient  characters  before  the  current position, the       insufficient  characters  before  the  current position, the
2678       match is deemed to fail.  Lookbehinds  in  conjunction  with       match is deemed to fail.
2679       once-only  subpatterns can be particularly useful for match-  
2680       ing at the ends of strings; an example is given at  the  end       PCRE does not allow the \C escape (which  matches  a  single
2681       of the section on once-only subpatterns.       byte  in  UTF-8  mode)  to  appear in lookbehind assertions,
2682         because it makes it impossible to calculate  the  length  of
2683         the lookbehind.
2684    
2685         Atomic groups can be used  in  conjunction  with  lookbehind
2686         assertions  to  specify efficient matching at the end of the
2687         subject string. Consider a simple pattern such as
2688    
2689           abcd$
2690    
2691         when applied to a long string that does not  match.  Because
2692         matching  proceeds  from  left  to right, PCRE will look for
2693         each "a" in the subject and then see if what follows matches
2694         the rest of the pattern. If the pattern is specified as
2695    
2696           ^.*abcd$
2697    
2698         the initial .* matches the entire string at first, but  when
2699         this  fails  (because  there  is no following "a"), it back-
2700         tracks to match all but the last character, then all but the
2701         last  two  characters,  and so on. Once again the search for
2702         "a" covers the entire string, from right to left, so we  are
2703         no better off. However, if the pattern is written as
2704    
2705           ^(?>.*)(?<=abcd)
2706    
2707         or, equivalently,
2708    
2709           ^.*+(?<=abcd)
2710    
2711         there can be no backtracking for the .* item; it  can  match
2712         only  the entire string. The subsequent lookbehind assertion
2713         does a single test on the last four characters. If it fails,
2714         the match fails immediately. For long strings, this approach
2715         makes a significant difference to the processing time.
2716    
2717       Several assertions (of any sort) may  occur  in  succession.       Several assertions (of any sort) may  occur  in  succession.
2718       For example,       For example,
# Line 1478  ASSERTIONS Line 2722  ASSERTIONS
2722       matches "foo" preceded by three digits that are  not  "999".       matches "foo" preceded by three digits that are  not  "999".
2723       Notice  that each of the assertions is applied independently       Notice  that each of the assertions is applied independently
2724       at the same point in the subject string. First  there  is  a       at the same point in the subject string. First  there  is  a
2725       check  that  the  previous  three characters are all digits,       check that the previous three characters are all digits, and
2726       then there is a check that the same three characters are not       then there is a check that the same three characters are not
2727       "999".   This  pattern  does not match "foo" preceded by six       "999".   This  pattern  does not match "foo" preceded by six
2728       characters, the first of which are digits and the last three       characters, the first of which are digits and the last three
# Line 1513  ASSERTIONS Line 2757  ASSERTIONS
2757       for positive assertions, because it does not make sense  for       for positive assertions, because it does not make sense  for
2758       negative assertions.       negative assertions.
2759    
      Assertions count towards the maximum  of  200  parenthesized  
      subpatterns.  
   
   
   
 ONCE-ONLY SUBPATTERNS  
      With both maximizing and minimizing repetition,  failure  of  
      what  follows  normally  causes  the repeated item to be re-  
      evaluated to see if a different number of repeats allows the  
      rest  of  the  pattern  to  match. Sometimes it is useful to  
      prevent this, either to change the nature of the  match,  or  
      to  cause  it fail earlier than it otherwise might, when the  
      author of the pattern knows there is no  point  in  carrying  
      on.  
   
      Consider, for example, the pattern \d+foo  when  applied  to  
      the subject line  
   
        123456bar  
   
      After matching all 6 digits and then failing to match "foo",  
      the normal action of the matcher is to try again with only 5  
      digits matching the \d+ item, and then with 4,  and  so  on,  
      before ultimately failing. Once-only subpatterns provide the  
      means for specifying that once a portion of the pattern  has  
      matched,  it  is  not to be re-evaluated in this way, so the  
      matcher would give up immediately on failing to match  "foo"  
      the  first  time.  The  notation  is another kind of special  
      parenthesis, starting with (?> as in this example:  
   
        (?>\d+)bar  
   
      This kind of parenthesis "locks up" the  part of the pattern  
      it  contains once it has matched, and a failure further into  
      the pattern is prevented from backtracking  into  it.  Back-  
      tracking  past  it to previous items, however, works as nor-  
      mal.  
   
      An alternative description is that a subpattern of this type  
      matches  the  string  of  characters that an identical stan-  
      dalone pattern would match, if anchored at the current point  
      in the subject string.  
   
      Once-only subpatterns are not capturing subpatterns.  Simple  
      cases  such as the above example can be thought of as a max-  
      imizing repeat that must  swallow  everything  it  can.  So,  
      while both \d+ and \d+? are prepared to adjust the number of  
      digits they match in order to make the rest of  the  pattern  
      match, (?>\d+) can only match an entire sequence of digits.  
   
      This construction can of course contain arbitrarily  compli-  
      cated subpatterns, and it can be nested.  
   
      Once-only subpatterns can be used in conjunction with  look-  
      behind  assertions  to specify efficient matching at the end  
      of the subject string. Consider a simple pattern such as  
   
        abcd$  
   
      when applied to a long  string  which  does  not  match  it.  
      Because matching proceeds from left to right, PCRE will look  
      for each "a" in the subject and then  see  if  what  follows  
      matches the rest of the pattern. If the pattern is specified  
      as  
   
        ^.*abcd$  
   
      then the initial .* matches the entire string at first,  but  
      when  this  fails,  it  backtracks to match all but the last  
      character, then all but the last two characters, and so  on.  
      Once again the search for "a" covers the entire string, from  
      right to left, so we are no better off. However, if the pat-  
      tern is written as  
   
        ^(?>.*)(?<=abcd)  
   
      then there can be no backtracking for the .*  item;  it  can  
      match  only  the  entire  string.  The subsequent lookbehind  
      assertion does a single test on the last four characters. If  
      it  fails,  the  match  fails immediately. For long strings,  
      this approach makes a significant difference to the process-  
      ing time.  
   
   
2760    
2761  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
2762    
2763       It is possible to cause the matching process to obey a  sub-       It is possible to cause the matching process to obey a  sub-
2764       pattern  conditionally  or to choose between two alternative       pattern  conditionally  or to choose between two alternative
2765       subpatterns, depending on the result  of  an  assertion,  or       subpatterns, depending on the result  of  an  assertion,  or
# Line 1613  CONDITIONAL SUBPATTERNS Line 2774  CONDITIONAL SUBPATTERNS
2774       more than two alternatives in the subpattern, a compile-time       more than two alternatives in the subpattern, a compile-time
2775       error occurs.       error occurs.
2776    
2777       There are two kinds of condition. If the  text  between  the       There are three kinds of condition. If the text between  the
2778       parentheses  consists  of  a  sequence  of  digits, then the       parentheses  consists of a sequence of digits, the condition
2779       condition is satisfied if the capturing subpattern  of  that       is satisfied if the capturing subpattern of that number  has
2780       number  has  previously matched. Consider the following pat-       previously  matched.  The  number must be greater than zero.
2781       tern, which contains non-significant white space to make  it       Consider  the  following  pattern,   which   contains   non-
2782       more  readable  (assume  the  PCRE_EXTENDED  option)  and to       significant white space to make it more readable (assume the
2783       divide it into three parts for ease of discussion:       PCRE_EXTENDED option) and to divide it into three parts  for
2784         ease of discussion:
2785    
2786         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
2787    
# Line 1636  CONDITIONAL SUBPATTERNS Line 2798  CONDITIONAL SUBPATTERNS
2798       matches a sequence of non-parentheses,  optionally  enclosed       matches a sequence of non-parentheses,  optionally  enclosed
2799       in parentheses.       in parentheses.
2800    
2801       If the condition is not a sequence of digits, it must be  an       If the condition is the string (R), it  is  satisfied  if  a
2802       assertion.  This  may be a positive or negative lookahead or       recursive  call  to the pattern or subpattern has been made.
2803       lookbehind assertion. Consider this pattern, again  contain-       At "top level", the condition is  false.   This  is  a  PCRE
2804       ing  non-significant  white space, and with the two alterna-       extension.  Recursive  patterns  are  described  in the next
2805       tives on the second line:       section.
2806    
2807         If the condition is not a sequence of digits or (R), it must
2808         be  an assertion.  This may be a positive or negative looka-
2809         head or lookbehind assertion. Consider this  pattern,  again
2810         containing  non-significant  white  space,  and with the two
2811         alternatives on the second line:
2812    
2813         (?(?=[^a-z]*[a-z])         (?(?=[^a-z]*[a-z])
2814         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
# Line 1655  CONDITIONAL SUBPATTERNS Line 2823  CONDITIONAL SUBPATTERNS
2823       letters and dd are digits.       letters and dd are digits.
2824    
2825    
   
2826  COMMENTS  COMMENTS
2827    
2828       The sequence (?# marks the start of a comment which  contin-       The sequence (?# marks the start of a comment which  contin-
2829       ues  up  to the next closing parenthesis. Nested parentheses       ues  up  to the next closing parenthesis. Nested parentheses
2830       are not permitted. The characters that  make  up  a  comment       are not permitted. The characters that  make  up  a  comment
# Line 1667  COMMENTS Line 2835  COMMENTS
2835       ues up to the next newline character in the pattern.       ues up to the next newline character in the pattern.
2836    
2837    
2838    RECURSIVE PATTERNS
2839    
2840  PERFORMANCE       Consider the problem of matching a  string  in  parentheses,
2841       Certain items that may appear in patterns are more efficient       allowing  for  unlimited nested parentheses. Without the use
2842       than  others.  It is more efficient to use a character class       of recursion, the best that can be done is to use a  pattern
2843       like [aeiou] than a set of alternatives such as (a|e|i|o|u).       that  matches  up  to some fixed depth of nesting. It is not
2844       In  general,  the  simplest  construction  that provides the       possible to handle an arbitrary nesting depth. Perl has pro-
2845       required behaviour is usually the  most  efficient.  Jeffrey       vided  an  experimental facility that allows regular expres-
2846       Friedl's  book contains a lot of discussion about optimizing       sions to recurse (amongst other things).  It  does  this  by
2847       regular expressions for efficient performance.       interpolating  Perl  code in the expression at run time, and
2848         the code can refer to the expression itself. A Perl  pattern
2849       When a pattern begins with .* and the PCRE_DOTALL option  is       to solve the parentheses problem can be created like this:
2850       set,  the  pattern  is implicitly anchored by PCRE, since it  
2851       can match only at the start of a subject string. However, if         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2852       PCRE_DOTALL  is not set, PCRE cannot make this optimization,  
2853       because the . metacharacter does not then match  a  newline,       The (?p{...}) item interpolates Perl code at run  time,  and
2854       and if the subject string contains newlines, the pattern may       in  this  case refers recursively to the pattern in which it
2855       match from the character immediately following one  of  them       appears. Obviously, PCRE cannot support the interpolation of
2856       instead of from the very start. For example, the pattern       Perl  code.  Instead,  it  supports  some special syntax for
2857         recursion of the entire pattern,  and  also  for  individual
2858         subpattern recursion.
2859    
2860         The special item that consists of (? followed  by  a  number
2861         greater  than  zero and a closing parenthesis is a recursive
2862         call of the subpattern of the given number, provided that it
2863         occurs inside that subpattern. (If not, it is a "subroutine"
2864         call, which is described in the next section.)  The  special
2865         item  (?R) is a recursive call of the entire regular expres-
2866         sion.
2867    
2868         For example, this PCRE pattern solves the nested parentheses
2869         problem  (assume  the  PCRE_EXTENDED  option  is set so that
2870         white space is ignored):
2871    
2872           \( ( (?>[^()]+) | (?R) )* \)
2873    
2874         First it matches an opening parenthesis. Then it matches any
2875         number  of substrings which can either be a sequence of non-
2876         parentheses, or a recursive  match  of  the  pattern  itself
2877         (that  is  a  correctly  parenthesized  substring).  Finally
2878         there is a closing parenthesis.
2879    
2880         If this were part of a larger pattern, you would not want to
2881         recurse the entire pattern, so instead you could use this:
2882    
2883           ( \( ( (?>[^()]+) | (?1) )* \) )
2884    
2885         We have put the pattern into  parentheses,  and  caused  the
2886         recursion  to refer to them instead of the whole pattern. In
2887         a larger pattern, keeping track of parenthesis  numbers  can
2888         be   tricky.   It  may  be  more  convenient  to  use  named
2889         parentheses instead. For this, PCRE uses (?P>name), which is
2890         an  extension  to the Python syntax that PCRE uses for named
2891         parentheses (Perl does not provide  named  parentheses).  We
2892         could rewrite the above example as follows:
2893    
2894           (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2895    
2896         This particular example pattern  contains  nested  unlimited
2897         repeats,  and  so  the  use  of atomic grouping for matching
2898         strings of non-parentheses is important  when  applying  the
2899         pattern to strings that do not match. For example, when this
2900         pattern is applied to
2901    
2902           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2903    
2904         it yields "no match" quickly. However, if atomic grouping is
2905         not used, the match runs for a very long time indeed because
2906         there are so many different ways the +  and  *  repeats  can
2907         carve  up  the  subject,  and  all  have to be tested before
2908         failure can be reported.
2909         At the end of a match, the values set for any capturing sub-
2910         patterns are those from the outermost level of the recursion
2911         at which the subpattern value is set.  If you want to obtain
2912         intermediate  values,  a  callout  function can be used (see
2913         below and the pcrecallout  documentation).  If  the  pattern
2914         above is matched against
2915    
2916           (ab(cd)ef)
2917    
2918         the value for the capturing parentheses is  "ef",  which  is
2919         the  last  value  taken  on  at the top level. If additional
2920         parentheses are added, giving
2921    
2922           \( ( ( (?>[^()]+) | (?R) )* ) \)
2923              ^                        ^
2924              ^                        ^
2925    
2926         the string they capture is "ab(cd)ef", the contents  of  the
2927         top  level  parentheses. If there are more than 15 capturing
2928         parentheses in a pattern, PCRE has to obtain extra memory to
2929         store  data  during  a  recursion,  which  it  does by using
2930         pcre_malloc, freeing it  via  pcre_free  afterwards.  If  no
2931         memory   can   be   obtained,   the  match  fails  with  the
2932         PCRE_ERROR_NOMEMORY error.
2933    
2934         Do not confuse the (?R) item with the condition  (R),  which
2935         tests  for  recursion.  Consider this pattern, which matches
2936         text in angle brackets, allowing for arbitrary nesting. Only
2937         digits are allowed in nested brackets (that is, when recurs-
2938         ing), whereas any characters  are  permitted  at  the  outer
2939         level.
2940    
2941           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2942    
2943         In this pattern, (?(R) is the start of a conditional subpat-
2944         tern,  with two different alternatives for the recursive and
2945         non-recursive cases. The (?R) item is the  actual  recursive
2946         call.
2947    
2948    
2949    SUBPATTERNS AS SUBROUTINES
2950    
2951         If the syntax for a recursive subpattern  reference  (either
2952         by  number  or  by  name) is used outside the parentheses to
2953         which it refers, it operates like a subroutine in a program-
2954         ming  language. An earlier example pointed out that the pat-
2955         tern
2956    
2957         (.*) second         (sens|respons)e and \1ibility
2958    
2959         matches "sense and sensibility" and "response and  responsi-
2960         bility",  but not "sense and responsibility". If instead the
2961         pattern
2962    
2963           (sens|respons)e and (?1)ibility
2964    
2965         is used, it does match "sense and responsibility" as well as
2966         the other two strings. Such references must, however, follow
2967         the subpattern to which they refer.
2968    
2969    
2970    CALLOUTS
2971    
2972         Perl has a  feature  whereby  using  the  sequence  (?{...})
2973         causes  arbitrary  Perl  code  to be obeyed in the middle of
2974         matching a  regular  expression.  This  makes  it  possible,
2975         amongst  other  things, to extract different substrings that
2976         match the same pair of parentheses when there is  a  repeti-
2977         tion.
2978    
2979         PCRE provides a similar feature, but  of  course  it  cannot
2980         obey  arbitrary  Perl code. The feature is called "callout".
2981         The caller of PCRE provides an external function by  putting
2982         its  entry  point  in  the global variable pcre_callout.  By
2983         default, this variable contains  NULL,  which  disables  all
2984         calling out.
2985    
2986         Within a regular expression, (?C) indicates  the  points  at
2987         which  the external function is to be called. If you want to
2988         identify different callout points, you can put a number less
2989         than 256 after the letter C. The default value is zero.  For
2990         example, this pattern has two callout points:
2991    
2992           (?C1)9abc(?C2)def
2993    
2994         During matching, when PCRE  reaches  a  callout  point  (and
2995         pcre_callout is set), the external function is called. It is
2996         provided with the number of the  callout,  and,  optionally,
2997         one  item  of  data  originally  supplied  by  the caller of
2998         pcre_exec(). The callout  function  may  cause  matching  to
2999         backtrack,  or to fail altogether. A complete description of
3000         the interface to the callout function is given in the  pcre-
3001         callout documentation.
3002    
3003    Last updated: 03 February 2003
3004    Copyright (c) 1997-2003 University of Cambridge.
3005    -----------------------------------------------------------------------------
3006    
3007    NAME
3008         PCRE - Perl-compatible regular expressions
3009    
3010    
3011    PCRE PERFORMANCE
3012    
3013         Certain items that may appear in regular expression patterns
3014         are  more efficient than others. It is more efficient to use
3015         a character class like [aeiou] than a  set  of  alternatives
3016         such  as  (a|e|i|o|u). In general, the simplest construction
3017         that provides the required behaviour  is  usually  the  most
3018         efficient.  Jeffrey  Friedl's book contains a lot of discus-
3019         sion about optimizing regular expressions for efficient per-
3020         formance.
3021    
3022         When a pattern begins with .*  not  in  parentheses,  or  in
3023         parentheses that are not the subject of a backreference, and
3024         the PCRE_DOTALL option is set,  the  pattern  is  implicitly
3025         anchored  by PCRE, since it can match only at the start of a
3026         subject string. However, if PCRE_DOTALL  is  not  set,  PCRE
3027         cannot  make  this optimization, because the . metacharacter
3028         does not then match a newline, and  if  the  subject  string
3029         contains  newlines, the pattern may match from the character
3030         immediately following one of them instead of from  the  very
3031         start. For example, the pattern
3032    
3033           .*second
3034    
3035       matches the subject "first\nand second" (where \n stands for       matches the subject "first\nand second" (where \n stands for
3036       a newline character) with the first captured substring being       a newline character), with the match starting at the seventh
3037       "and". In order to do this, PCRE  has  to  retry  the  match       character. In order to do this, PCRE has to retry the  match
3038       starting after every newline in the subject.       starting after every newline in the subject.
3039    
3040       If you are using such a pattern with subject strings that do       If you are using such a pattern with subject strings that do
# Line 1713  PERFORMANCE Line 3057  PERFORMANCE
3057       that  the entire match is going to fail, PCRE has in princi-       that  the entire match is going to fail, PCRE has in princi-
3058       ple to try every possible variation, and this  can  take  an       ple to try every possible variation, and this  can  take  an
3059       extremely long time.       extremely long time.
   
3060       An optimization catches some of the more simple  cases  such       An optimization catches some of the more simple  cases  such
3061       as       as
3062    
# Line 1733  PERFORMANCE Line 3076  PERFORMANCE
3076       whereas the latter takes an appreciable  time  with  strings       whereas the latter takes an appreciable  time  with  strings
3077       longer than about 20 characters.       longer than about 20 characters.
3078    
3079    Last updated: 03 February 2003
3080    Copyright (c) 1997-2003 University of Cambridge.
3081    -----------------------------------------------------------------------------
3082    
3083    NAME
3084         PCRE - Perl-compatible regular expressions.
3085    
3086    
3087    SYNOPSIS OF POSIX API
3088         #include <pcreposix.h>
3089    
3090         int regcomp(regex_t *preg, const char *pattern,
3091              int cflags);
3092    
3093         int regexec(regex_t *preg, const char *string,
3094              size_t nmatch, regmatch_t pmatch[], int eflags);
3095    
3096         size_t regerror(int errcode, const regex_t *preg,
3097              char *errbuf, size_t errbuf_size);
3098    
3099         void regfree(regex_t *preg);
3100    
3101    
3102    DESCRIPTION
3103    
3104         This set of functions provides a POSIX-style API to the PCRE
3105         regular  expression  package.  See the pcreapi documentation
3106         for a description of the native API,  which  contains  addi-
3107         tional functionality.
3108    
3109         The functions described here are just wrapper functions that
3110         ultimately  call  the  PCRE native API. Their prototypes are
3111         defined in the pcreposix.h header file, and on Unix  systems
3112         the library itself is called pcreposix.a, so can be accessed
3113         by adding -lpcreposix to the command for linking an applica-
3114         tion  which  uses them. Because the POSIX functions call the
3115         native ones, it is also necessary to add -lpcre.
3116    
3117         I have implemented only those option bits that can  be  rea-
3118         sonably  mapped  to  PCRE  native  options. In addition, the
3119         options REG_EXTENDED and  REG_NOSUB  are  defined  with  the
3120         value zero. They have no effect, but since programs that are
3121         written to the POSIX interface often use them, this makes it
3122         easier to slot in PCRE as a replacement library. Other POSIX
3123         options are not even defined.
3124    
3125         When PCRE is called via these functions, it is only the  API
3126         that is POSIX-like in style. The syntax and semantics of the
3127         regular expressions themselves are still those of Perl, sub-
3128         ject  to  the  setting of various PCRE options, as described
3129         below. "POSIX-like in style" means that the API approximates
3130         to  the  POSIX definition; it is not fully POSIX-compatible,
3131         and in multi-byte encoding domains it is probably even  less
3132         compatible.
3133    
3134         The header for these functions is supplied as pcreposix.h to
3135         avoid  any  potential  clash  with other POSIX libraries. It
3136         can, of course, be renamed or aliased as regex.h,  which  is
3137         the "correct" name. It provides two structure types, regex_t
3138         for compiled internal forms, and  regmatch_t  for  returning
3139         captured  substrings.  It  also defines some constants whose
3140         names start with "REG_"; these are used for setting  options
3141         and identifying error codes.
3142    
3143    
3144    COMPILING A PATTERN
3145    
3146         The function regcomp() is called to compile a  pattern  into
3147         an  internal form. The pattern is a C string terminated by a
3148         binary zero, and is passed in the argument pattern. The preg
3149         argument  is  a pointer to a regex_t structure which is used
3150         as a base for storing information about the compiled expres-
3151         sion.
3152    
3153         The argument cflags is either zero, or contains one or  more
3154         of the bits defined by the following macros:
3155    
3156           REG_ICASE
3157    
3158         The PCRE_CASELESS option  is  set  when  the  expression  is
3159         passed for compilation to the native function.
3160    
3161           REG_NEWLINE
3162    
3163         The PCRE_MULTILINE option is  set  when  the  expression  is
3164         passed  for  compilation  to  the native function. Note that
3165         this  does  not  mimic  the  defined  POSIX  behaviour   for
3166         REG_NEWLINE (see the following section).
3167    
3168         In the absence of these flags, no options are passed to  the
3169         native  function.  This means the the regex is compiled with
3170         PCRE default semantics. In particular, the  way  it  handles
3171         newline  characters  in  the subject string is the Perl way,
3172         not the POSIX way. Note that setting PCRE_MULTILINE has only
3173         some  of  the effects specified for REG_NEWLINE. It does not
3174         affect the way newlines are matched by . (they aren't) or by
3175         a negative class such as [^a] (they are).
3176    
3177         The yield of regcomp() is zero on success, and non-zero oth-
3178         erwise.  The preg structure is filled in on success, and one
3179         member of the structure  is  public:  re_nsub  contains  the
3180         number  of  capturing subpatterns in the regular expression.
3181         Various error codes are defined in the header file.
3182    
3183    
3184    MATCHING NEWLINE CHARACTERS
3185    
3186         This area is not simple, because POSIX and  Perl  take  dif-
3187         ferent  views  of things.  It is not possible to get PCRE to
3188         obey POSIX semantics, but then PCRE was never intended to be
3189         a POSIX engine. The following table lists the different pos-
3190         sibilities for matching newline characters in PCRE:
3191    
3192                                   Default   Change with
3193    
3194           . matches newline          no     PCRE_DOTALL
3195           newline matches [^a]       yes    not changeable
3196           $ matches \n at end        yes    PCRE_DOLLARENDONLY
3197           $ matches \n in middle     no     PCRE_MULTILINE
3198           ^ matches \n in middle     no     PCRE_MULTILINE
3199    
3200         This is the equivalent table for POSIX:
3201    
3202                                   Default   Change with
3203    
3204           . matches newline          yes      REG_NEWLINE
3205           newline matches [^a]       yes      REG_NEWLINE
3206           $ matches \n at end        no       REG_NEWLINE
3207           $ matches \n in middle     no       REG_NEWLINE
3208           ^ matches \n in middle     no       REG_NEWLINE
3209    
3210         PCRE's behaviour is the same as Perl's, except that there is
3211         no  equivalent  for PCRE_DOLLARENDONLY in Perl. In both PCRE
3212         and Perl, there is no way  to  stop  newline  from  matching
3213         [^a].
3214    
3215         The default POSIX newline handling can be obtained  by  set-
3216         ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3217         to make PCRE behave exactly as for the REG_NEWLINE action.
3218    
3219    
3220    MATCHING A PATTERN
3221    
3222         The function regexec() is called  to  match  a  pre-compiled
3223         pattern  preg against a given string, which is terminated by
3224         a zero byte, subject to the options in eflags. These can be:
3225    
3226           REG_NOTBOL
3227    
3228         The PCRE_NOTBOL option is set when  calling  the  underlying
3229         PCRE matching function.
3230    
3231           REG_NOTEOL
3232    
3233         The PCRE_NOTEOL option is set when  calling  the  underlying
3234         PCRE matching function.
3235    
3236         The portion of the string that was  matched,  and  also  any
3237         captured  substrings,  are returned via the pmatch argument,
3238         which points to  an  array  of  nmatch  structures  of  type
3239         regmatch_t,  containing  the  members rm_so and rm_eo. These
3240         contain the offset to the first character of each  substring
3241         and  the offset to the first character after the end of each
3242         substring, respectively.  The  0th  element  of  the  vector
3243         relates  to  the  entire portion of string that was matched;
3244         subsequent elements relate to the capturing  subpatterns  of
3245         the  regular  expression.  Unused  entries in the array have
3246         both structure members set to -1.
3247    
3248         A successful match yields a zero return; various error codes
3249         are  defined in the header file, of which REG_NOMATCH is the
3250         "expected" failure code.
3251    
3252    
3253    ERROR MESSAGES
3254    
3255         The regerror()  function  maps  a  non-zero  errorcode  from
3256         either  regcomp()  or  regexec()  to a printable message. If
3257         preg is not NULL, the error should have arisen from the  use
3258         of  that structure. A message terminated by a binary zero is
3259         placed in errbuf. The length of the message,  including  the
3260         zero,  is  limited to errbuf_size. The yield of the function
3261         is the size of buffer needed to hold the whole message.
3262    
3263    
3264    STORAGE
3265    
3266         Compiling a regular expression causes memory to be allocated
3267         and  associated  with  the preg structure. The function reg-
3268         free() frees all such memory, after which preg may no longer
3269         be used as a compiled expression.
3270    
3271    
3272  AUTHOR  AUTHOR
3273    
3274       Philip Hazel <ph10@cam.ac.uk>       Philip Hazel <ph10@cam.ac.uk>
3275       University Computing Service,       University Computing Service,
      New Museums Site,  
3276       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
      Phone: +44 1223 334714  
3277    
3278       Last updated: 29 July 1999  Last updated: 03 February 2003
3279       Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
3280    -----------------------------------------------------------------------------
3281    
3282    NAME
3283         PCRE - Perl-compatible regular expressions
3284    
3285    
3286    PCRE SAMPLE PROGRAM
3287    
3288         A simple, complete demonstration program, to get you started
3289         with  using  PCRE, is supplied in the file pcredemo.c in the
3290         PCRE distribution.
3291    
3292         The program compiles the  regular  expression  that  is  its
3293         first argument, and matches it against the subject string in
3294         its second argument. No PCRE options are  set,  and  default
3295         character tables are used. If matching succeeds, the program
3296         outputs the portion of the subject  that  matched,  together
3297         with the contents of any captured substrings.
3298    
3299         If the -g option is given on the command line,  the  program
3300         then  goes on to check for further matches of the same regu-
3301         lar expression in the same subject string. The  logic  is  a
3302         little  bit tricky because of the possibility of matching an
3303         empty string. Comments in the code explain what is going on.
3304    
3305         On a Unix system that has PCRE installed in /usr/local,  you
3306         can  compile  the demonstration program using a command like
3307         this:
3308    
3309           gcc -o pcredemo pcredemo.c -I/usr/local/include \
3310               -L/usr/local/lib -lpcre
3311    
3312         Then you can run simple tests like this:
3313    
3314           ./pcredemo 'cat|dog' 'the cat sat on the mat'
3315           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3316    
3317         Note that there is a much more comprehensive  test  program,
3318         called  pcretest,  which  supports  many more facilities for
3319         testing  regular  expressions  and  the  PCRE  library.  The
3320         pcredemo program is provided as a simple coding example.
3321    
3322         On some operating systems (e.g.  Solaris)  you  may  get  an
3323         error like this when you try to run pcredemo:
3324    
3325           ld.so.1: a.out: fatal: libpcre.so.0: open failed: No  such
3326         file or directory
3327    
3328         This is caused by the way shared library  support  works  on
3329         those systems. You need to add
3330    
3331           -R/usr/local/lib
3332    
3333         to the compile command to get round this problem.
3334    
3335    Last updated: 28 January 2003
3336    Copyright (c) 1997-2003 University of Cambridge.
3337    -----------------------------------------------------------------------------
3338    

Legend:
Removed from v.41  
changed lines
  Added in v.71

  ViewVC Help
Powered by ViewVC 1.1.5