/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC revision 69 by nigel, Sat Feb 24 21:40:18 2007 UTC
# Line 1  Line 1 
1    This file contains a concatenation of the PCRE man pages, converted to plain
2    text format for ease of searching with a text editor, or for use on systems
3    that do not have a man page processor. The small individual files that give
4    synopses of each function in the library have not been included. There are
5    separate text files for the pcregrep and pcretest commands.
6    -----------------------------------------------------------------------------
7    
8    NAME
9         PCRE - Perl-compatible regular expressions
10    
11    
12    DESCRIPTION
13    
14         The PCRE library is a set of functions that implement  regu-
15         lar  expression  pattern  matching using the same syntax and
16         semantics as Perl, with just a few differences. The  current
17         implementation  of  PCRE  (release 4.x) corresponds approxi-
18         mately with Perl 5.8, including support  for  UTF-8  encoded
19         strings.    However,  this  support  has  to  be  explicitly
20         enabled; it is not the default.
21    
22         PCRE is written in C and released as a C library. However, a
23         number  of  people  have  written wrappers and interfaces of
24         various kinds. A C++ class is included  in  these  contribu-
25         tions,  which  can  be found in the Contrib directory at the
26         primary FTP site, which is:
27    
28         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
29    
30         Details of exactly which Perl  regular  expression  features
31         are  and  are  not  supported  by PCRE are given in separate
32         documents. See the pcrepattern and pcrecompat pages.
33    
34         Some features of PCRE can be included, excluded, or  changed
35         when  the library is built. The pcre_config() function makes
36         it possible for a client  to  discover  which  features  are
37         available.  Documentation  about  building  PCRE for various
38         operating systems can be found in the  README  file  in  the
39         source distribution.
40    
41    
42    USER DOCUMENTATION
43    
44         The user documentation for PCRE has been  split  up  into  a
45         number  of  different sections. In the "man" format, each of
46         these is a separate "man page". In the HTML format, each  is
47         a  separate  page,  linked from the index page. In the plain
48         text format, all the sections are concatenated, for ease  of
49         searching. The sections are as follows:
50    
51           pcre              this document
52           pcreapi           details of PCRE's native API
53           pcrebuild         options for building PCRE
54           pcrecallout       details of the callout feature
55           pcrecompat        discussion of Perl compatibility
56           pcregrep          description of the pcregrep command
57           pcrepattern       syntax and semantics of supported
58                               regular expressions
59           pcreperform       discussion of performance issues
60           pcreposix         the POSIX-compatible API
61           pcresample        discussion of the sample program
62           pcretest          the pcretest testing command
63    
64         In addition, in the "man" and HTML formats, there is a short
65         page  for  each  library function, listing its arguments and
66         results.
67    
68    
69    LIMITATIONS
70    
71         There are some size limitations in PCRE but it is hoped that
72         they will never in practice be relevant.
73    
74         The maximum length of a  compiled  pattern  is  65539  (sic)
75         bytes  if PCRE is compiled with the default internal linkage
76         size of 2. If you want to process regular  expressions  that
77         are  truly  enormous,  you can compile PCRE with an internal
78         linkage size of 3 or 4 (see the README file  in  the  source
79         distribution  and  the pcrebuild documentation for details).
80         If these cases the limit is substantially larger.   However,
81         the speed of execution will be slower.
82    
83         All values in repeating quantifiers must be less than 65536.
84         The maximum number of capturing subpatterns is 65535.
85    
86         There is no limit to the  number  of  non-capturing  subpat-
87         terns,  but  the  maximum  depth  of nesting of all kinds of
88         parenthesized subpattern, including  capturing  subpatterns,
89         assertions, and other types of subpattern, is 200.
90    
91         The maximum length of a subject string is the largest  posi-
92         tive number that an integer variable can hold. However, PCRE
93         uses recursion to handle subpatterns and indefinite  repeti-
94         tion.  This  means  that the available stack space may limit
95         the size of a subject string that can be processed  by  cer-
96         tain patterns.
97    
98    
99    UTF-8 SUPPORT
100    
101         Starting at release 3.3, PCRE has had some support for char-
102         acter  strings  encoded in the UTF-8 format. For release 4.0
103         this has been greatly extended to cover most common require-
104         ments.
105    
106         In order process UTF-8  strings,  you  must  build  PCRE  to
107         include  UTF-8  support  in  the code, and, in addition, you
108         must call pcre_compile() with  the  PCRE_UTF8  option  flag.
109         When  you  do this, both the pattern and any subject strings
110         that are matched against it are  treated  as  UTF-8  strings
111         instead of just strings of bytes.
112    
113         If you compile PCRE with UTF-8 support, but do not use it at
114         run  time,  the  library will be a bit bigger, but the addi-
115         tional run time overhead is limited to testing the PCRE_UTF8
116         flag in several places, so should not be very large.
117    
118         The following comments apply when PCRE is running  in  UTF-8
119         mode:
120    
121         1. PCRE assumes that the strings it is given  contain  valid
122         UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
123         you pass invalid UTF-8 strings  to  PCRE,  the  results  are
124         undefined.
125    
126         2. In a pattern, the escape sequence \x{...}, where the con-
127         tents  of  the  braces is a string of hexadecimal digits, is
128         interpreted as a UTF-8 character whose code  number  is  the
129         given  hexadecimal  number, for example: \x{1234}. If a non-
130         hexadecimal digit appears between the braces,  the  item  is
131         not  recognized.  This escape sequence can be used either as
132         a literal, or within a character class.
133    
134         3. The original hexadecimal escape sequence, \xhh, matches a
135         two-byte UTF-8 character if the value is greater than 127.
136    
137         4. Repeat quantifiers apply to  complete  UTF-8  characters,
138         not to individual bytes, for example: \x{100}{3}.
139    
140         5. The dot metacharacter matches one UTF-8 character instead
141         of a single byte.
142    
143         6. The escape sequence \C can be used to match a single byte
144         in UTF-8 mode, but its use can lead to some strange effects.
145    
146         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W
147         correctly test characters of any code value, but the charac-
148         ters that PCRE recognizes as digits, spaces, or word charac-
149         ters  remain  the  same  set as before, all with values less
150         than 256.
151    
152         8. Case-insensitive  matching  applies  only  to  characters
153         whose  values  are  less than 256. PCRE does not support the
154         notion of "case" for higher-valued characters.
155    
156         9. PCRE does not support the use of Unicode tables and  pro-
157         perties or the Perl escapes \p, \P, and \X.
158    
159    
160    AUTHOR
161    
162         Philip Hazel <ph10@cam.ac.uk>
163         University Computing Service,
164         Cambridge CB2 3QG, England.
165         Phone: +44 1223 334714
166    
167    Last updated: 04 February 2003
168    Copyright (c) 1997-2003 University of Cambridge.
169    -----------------------------------------------------------------------------
170    
171    NAME
172         PCRE - Perl-compatible regular expressions
173    
174    
175    PCRE BUILD-TIME OPTIONS
176    
177         This document describes the optional features of  PCRE  that
178         can  be  selected when the library is compiled. They are all
179         selected, or deselected, by providing options to the config-
180         ure  script  which  is run before the make command. The com-
181         plete list of options  for  configure  (which  includes  the
182         standard  ones  such  as  the  selection of the installation
183         directory) can be obtained by running
184    
185           ./configure --help
186    
187         The following sections describe certain options whose  names
188         begin  with  --enable  or  --disable. These settings specify
189         changes to the defaults for the configure  command.  Because
190         of  the  way  that  configure  works, --enable and --disable
191         always come in pairs, so  the  complementary  option  always
192         exists  as  well, but as it specifies the default, it is not
193         described.
194    
195    
196    UTF-8 SUPPORT
197    
198         To build PCRE with support for UTF-8 character strings, add
199    
200           --enable-utf8
201    
202         to the configure command. Of itself, this does not make PCRE
203         treat  strings as UTF-8. As well as compiling PCRE with this
204         option, you also have have to set the PCRE_UTF8 option  when
205         you call the pcre_compile() function.
206    
207    
208    CODE VALUE OF NEWLINE
209    
210         By default, PCRE treats character 10 (linefeed) as the  new-
211         line  character.  This  is  the  normal newline character on
212         Unix-like systems. You can compile PCRE to use character  13
213         (carriage return) instead by adding
214    
215           --enable-newline-is-cr
216    
217         to the configure command. For completeness there is  also  a
218         --enable-newline-is-lf  option,  which  explicitly specifies
219         linefeed as the newline character.
220    
221    
222    BUILDING SHARED AND STATIC LIBRARIES
223    
224         The PCRE building process uses libtool to build both  shared
225         and  static  Unix libraries by default. You can suppress one
226         of these by adding one of
227    
228           --disable-shared
229           --disable-static
230    
231         to the configure command, as required.
232    
233    
234    POSIX MALLOC USAGE
235    
236         When PCRE is called through the  POSIX  interface  (see  the
237         pcreposix  documentation),  additional  working  storage  is
238         required for holding the pointers  to  capturing  substrings
239         because  PCRE requires three integers per substring, whereas
240         the POSIX interface provides only  two.  If  the  number  of
241         expected  substrings  is  small,  the  wrapper function uses
242         space on the stack, because this is faster than  using  mal-
243         loc()  for  each call. The default threshold above which the
244         stack is no longer used is 10; it can be changed by adding a
245         setting such as
246    
247           --with-posix-malloc-threshold=20
248    
249         to the configure command.
250    
251    
252    LIMITING PCRE RESOURCE USAGE
253    
254         Internally, PCRE has a  function  called  match()  which  it
255         calls  repeatedly  (possibly  recursively) when performing a
256         matching operation. By limiting the  number  of  times  this
257         function  may  be  called,  a  limit  can  be  placed on the
258         resources used by a single call to  pcre_exec().  The  limit
259         can  be  changed  at  run  time, as described in the pcreapi
260         documentation. The default is 10 million, but  this  can  be
261         changed by adding a setting such as
262    
263           --with-match-limit=500000
264    
265         to the configure command.
266    
267    
268    HANDLING VERY LARGE PATTERNS
269    
270         Within a compiled pattern, offset values are used  to  point
271         from  one  part  to  another  (for  example, from an opening
272         parenthesis to an  alternation  metacharacter).  By  default
273         two-byte  values  are  used  for these offsets, leading to a
274         maximum size for a compiled pattern of around 64K.  This  is
275         sufficient  to  handle  all  but the most gigantic patterns.
276         Nevertheless, some people do want to process  enormous  pat-
277         terns,  so  it is possible to compile PCRE to use three-byte
278         or four-byte offsets by adding a setting such as
279    
280           --with-link-size=3
281    
282         to the configure command. The value given must be 2,  3,  or
283         4.  Using  longer  offsets  slows down the operation of PCRE
284         because it has to load additional bytes when handling them.
285    
286         If you build PCRE with an increased link size, test  2  (and
287         test 5 if you are using UTF-8) will fail. Part of the output
288         of these tests is a representation of the compiled  pattern,
289         and this changes with the link size.
290    
291    Last updated: 21 January 2003
292    Copyright (c) 1997-2003 University of Cambridge.
293    -----------------------------------------------------------------------------
294    
295  NAME  NAME
296       pcre - Perl-compatible regular expressions.       PCRE - Perl-compatible regular expressions
297    
298    
299    SYNOPSIS OF PCRE API
300    
 SYNOPSIS  
301       #include <pcre.h>       #include <pcre.h>
302    
303       pcre *pcre_compile(const char *pattern, int options,       pcre *pcre_compile(const char *pattern, int options,
# Line 17  SYNOPSIS Line 311  SYNOPSIS
311            const char *subject, int length, int startoffset,            const char *subject, int length, int startoffset,
312            int options, int *ovector, int ovecsize);            int options, int *ovector, int ovecsize);
313    
314         int pcre_copy_named_substring(const pcre *code,
315              const char *subject, int *ovector,
316              int stringcount, const char *stringname,
317              char *buffer, int buffersize);
318    
319       int pcre_copy_substring(const char *subject, int *ovector,       int pcre_copy_substring(const char *subject, int *ovector,
320            int stringcount, int stringnumber, char *buffer,            int stringcount, int stringnumber, char *buffer,
321            int buffersize);            int buffersize);
322    
323         int pcre_get_named_substring(const pcre *code,
324              const char *subject, int *ovector,
325              int stringcount, const char *stringname,
326              const char **stringptr);
327    
328         int pcre_get_stringnumber(const pcre *code,
329              const char *name);
330    
331       int pcre_get_substring(const char *subject, int *ovector,       int pcre_get_substring(const char *subject, int *ovector,
332            int stringcount, int stringnumber,            int stringcount, int stringnumber,
333            const char **stringptr);            const char **stringptr);
# Line 28  SYNOPSIS Line 335  SYNOPSIS
335       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
336            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
337    
338         void pcre_free_substring(const char *stringptr);
339    
340         void pcre_free_substring_list(const char **stringptr);
341    
342       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
343    
344       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345            int what, void *where);            int what, void *where);
346    
347    
348       int pcre_info(const pcre *code, int *optptr, *firstcharptr);       int pcre_info(const pcre *code, int *optptr, *firstcharptr);
349    
350         int pcre_config(int what, void *where);
351    
352       char *pcre_version(void);       char *pcre_version(void);
353    
354       void *(*pcre_malloc)(size_t);       void *(*pcre_malloc)(size_t);
355    
356       void (*pcre_free)(void *);       void (*pcre_free)(void *);
357    
358         int (*pcre_callout)(pcre_callout_block *);
359    
360    
361    PCRE API
 DESCRIPTION  
      The PCRE library is a set of functions that implement  regu-  
      lar  expression  pattern  matching using the same syntax and  
      semantics as Perl  5,  with  just  a  few  differences  (see  
      below).  The  current  implementation  corresponds  to  Perl  
      5.005, with some additional features from the Perl  develop-  
      ment release.  
362    
363       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
364       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
# Line 67  DESCRIPTION Line 375  DESCRIPTION
375       releases.       releases.
376    
377       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
378       are  used  for  compiling  and matching regular expressions,       are  used  for compiling and matching regular expressions. A
379       while   pcre_copy_substring(),   pcre_get_substring(),   and       sample program that demonstrates the simplest way  of  using
380       pcre_get_substring_list()   are  convenience  functions  for       them  is  given in the file pcredemo.c. The pcresample docu-
381       extracting  captured  substrings  from  a  matched   subject       mentation describes how to run it.
382       string.  The function pcre_maketables() is used (optionally)  
383       to build a set of character tables in the current locale for       There are convenience functions for extracting captured sub-
384       passing to pcre_compile().       strings from a matched subject string. They are:
385    
386           pcre_copy_substring()
387           pcre_copy_named_substring()
388           pcre_get_substring()
389           pcre_get_named_substring()
390           pcre_get_substring_list()
391    
392         pcre_free_substring()  and  pcre_free_substring_list()   are
393         also  provided,  to  free  the  memory  used  for  extracted
394         strings.
395    
396         The function pcre_maketables() is used (optionally) to build
397         a  set of character tables in the current locale for passing
398         to pcre_compile().
399    
400       The function pcre_fullinfo() is used to find out information       The function pcre_fullinfo() is used to find out information
401       about a compiled pattern; pcre_info() is an obsolete version       about a compiled pattern; pcre_info() is an obsolete version
# Line 89  DESCRIPTION Line 411  DESCRIPTION
411       replace them if it  wishes  to  intercept  the  calls.  This       replace them if it  wishes  to  intercept  the  calls.  This
412       should be done before calling any PCRE functions.       should be done before calling any PCRE functions.
413    
414         The global variable pcre_callout initially contains NULL. It
415         can be set by the caller to a "callout" function, which PCRE
416         will then call at specified points during a matching  opera-
417         tion. Details are given in the pcrecallout documentation.
418    
419    
420  MULTI-THREADING  MULTITHREADING
421    
422       The PCRE functions can be used in  multi-threading  applica-       The PCRE functions can be used in  multi-threading  applica-
423       tions, with the proviso that the memory management functions       tions, with the proviso that the memory management functions
424       pointed to by pcre_malloc and pcre_free are  shared  by  all       pointed to by pcre_malloc and  pcre_free,  and  the  callout
425         function  pointed  to  by  pcre_callout,  are  shared by all
426       threads.       threads.
427    
428       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
# Line 102  MULTI-THREADING Line 430  MULTI-THREADING
430       used by several threads at once.       used by several threads at once.
431    
432    
433    CHECKING BUILD-TIME OPTIONS
434    
435         int pcre_config(int what, void *where);
436    
437         The function pcre_config() makes  it  possible  for  a  PCRE
438         client  to  discover  which optional features have been com-
439         piled into the PCRE library. The pcrebuild documentation has
440         more details about these optional features.
441    
442         The first argument for pcre_config() is an integer, specify-
443         ing  which information is required; the second argument is a
444         pointer to a variable into which the information is  placed.
445         The following information is available:
446    
447           PCRE_CONFIG_UTF8
448    
449         The output is an integer that is set to one if UTF-8 support
450         is available; otherwise it is set to zero.
451    
452           PCRE_CONFIG_NEWLINE
453    
454         The output is an integer that is set to  the  value  of  the
455         code  that  is  used for the newline character. It is either
456         linefeed (10) or carriage return (13), and  should  normally
457         be the standard character for your operating system.
458    
459           PCRE_CONFIG_LINK_SIZE
460    
461         The output is an integer that contains the number  of  bytes
462         used  for  internal linkage in compiled regular expressions.
463         The value is 2, 3, or 4. Larger values allow larger  regular
464         expressions  to be compiled, at the expense of slower match-
465         ing. The default value of 2 is sufficient for  all  but  the
466         most  massive patterns, since it allows the compiled pattern
467         to be up to 64K in size.
468    
469           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
470    
471         The output is an integer that contains the  threshold  above
472         which  the POSIX interface uses malloc() for output vectors.
473         Further details are given in the pcreposix documentation.
474    
475           PCRE_CONFIG_MATCH_LIMIT
476    
477         The output is an integer that gives the  default  limit  for
478         the   number  of  internal  matching  function  calls  in  a
479         pcre_exec()  execution.  Further  details  are  given   with
480         pcre_exec() below.
481    
482    
483  COMPILING A PATTERN  COMPILING A PATTERN
484    
485         pcre *pcre_compile(const char *pattern, int options,
486              const char **errptr, int *erroffset,
487              const unsigned char *tableptr);
488    
489       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
490       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
491       by a binary zero, and is passed in the argument  pattern.  A       by a binary zero, and is passed in the argument  pattern.  A
492       pointer  to  a  single  block of memory that is obtained via       pointer  to  a  single  block of memory that is obtained via
493       pcre_malloc is returned. This contains the compiled code and       pcre_malloc is returned. This contains the compiled code and
494       related data. The pcre type is defined for this for conveni-       related  data.  The  pcre  type  is defined for the returned
495       ence, but in fact pcre is just a typedef for void, since the       block; this is a typedef for a structure whose contents  are
496       contents  of  the block are not externally defined. It is up       not  externally  defined. It is up to the caller to free the
497       to the caller to free  the  memory  when  it  is  no  longer       memory when it is no longer required.
498       required.  
499         Although the compiled code of a PCRE regex  is  relocatable,
500       The size of a compiled pattern is  roughly  proportional  to       that is, it does not depend on memory location, the complete
501       the length of the pattern string, except that each character       pcre data block is not fully relocatable,  because  it  con-
502       class (other than those containing just a single  character,       tains  a  copy of the tableptr argument, which is an address
503       negated  or  not)  requires 33 bytes, and repeat quantifiers       (see below).
      with a minimum greater than one or a bounded  maximum  cause  
      the  relevant  portions of the compiled pattern to be repli-  
      cated.  
   
504       The options argument contains independent bits  that  affect       The options argument contains independent bits  that  affect
505       the  compilation.  It  should  be  zero  if  no  options are       the  compilation.  It  should  be  zero  if  no  options are
506       required. Some of the options, in particular, those that are       required. Some of the options, in particular, those that are
507       compatible  with Perl, can also be set and unset from within       compatible  with Perl, can also be set and unset from within
508       the pattern (see the detailed description of regular expres-       the pattern (see the detailed description of regular expres-
509       sions below). For these options, the contents of the options       sions  in the pcrepattern documentation). For these options,
510       argument specifies their initial settings at  the  start  of       the contents of the options argument specifies their initial
511       compilation  and  execution. The PCRE_ANCHORED option can be       settings  at  the  start  of  compilation and execution. The
512       set at the time of matching as well as at compile time.       PCRE_ANCHORED option can be set at the time of  matching  as
513         well as at compile time.
514    
515       If errptr is NULL, pcre_compile() returns NULL  immediately.       If errptr is NULL, pcre_compile() returns NULL  immediately.
516       Otherwise, if compilation of a pattern fails, pcre_compile()       Otherwise, if compilation of a pattern fails, pcre_compile()
# Line 149  COMPILING A PATTERN Line 527  COMPILING A PATTERN
527       must  be  the result of a call to pcre_maketables(). See the       must  be  the result of a call to pcre_maketables(). See the
528       section on locale support below.       section on locale support below.
529    
530       The following option bits are defined in the header file:       This code fragment shows a typical straightforward  call  to
531         pcre_compile():
532    
533           pcre *re;
534           const char *error;
535           int erroffset;
536           re = pcre_compile(
537             "^A.*Z",          /* the pattern */
538             0,                /* default options */
539             &error,           /* for error message */
540             &erroffset,       /* for error offset */
541             NULL);            /* use default character tables */
542    
543         The following option bits are defined:
544    
545         PCRE_ANCHORED         PCRE_ANCHORED
546    
547       If this bit is set, the pattern is forced to be  "anchored",       If this bit is set, the pattern is forced to be  "anchored",
548       that is, it is constrained to match only at the start of the       that is, it is constrained to match only at the first match-
549       string which is being searched (the "subject string").  This       ing point in the string which is being searched  (the  "sub-
550       effect can also be achieved by appropriate constructs in the       ject string"). This effect can also be achieved by appropri-
551       pattern itself, which is the only way to do it in Perl.       ate constructs in the pattern itself, which is the only  way
552         to do it in Perl.
553    
554         PCRE_CASELESS         PCRE_CASELESS
555    
556       If this bit is set, letters in the pattern match both  upper       If this bit is set, letters in the pattern match both  upper
557       and  lower  case  letters.  It  is  equivalent  to Perl's /i       and  lower  case  letters.  It  is  equivalent  to Perl's /i
558       option.       option, and it can be changed within a  pattern  by  a  (?i)
559         option setting.
560    
561         PCRE_DOLLAR_ENDONLY         PCRE_DOLLAR_ENDONLY
562    
# Line 173  COMPILING A PATTERN Line 566  COMPILING A PATTERN
566       character  if it is a newline (but not before any other new-       character  if it is a newline (but not before any other new-
567       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if
568       PCRE_MULTILINE is set. There is no equivalent to this option       PCRE_MULTILINE is set. There is no equivalent to this option
569       in Perl.       in Perl, and no way to set it within a pattern.
570    
571         PCRE_DOTALL         PCRE_DOTALL
572    
573       If this bit is  set,  a  dot  metacharater  in  the  pattern       If this bit is  set,  a  dot  metacharater  in  the  pattern
574       matches all characters, including newlines. Without it, new-       matches all characters, including newlines. Without it, new-
575       lines are excluded. This option is equivalent to  Perl's  /s       lines are excluded. This option is equivalent to  Perl's  /s
576       option.  A negative class such as [^a] always matches a new-       option,  and  it  can  be changed within a pattern by a (?s)
577       line character, independent of the setting of this option.       option setting. A negative class such as [^a] always matches
578         a  newline  character,  independent  of  the setting of this
579         option.
580    
581         PCRE_EXTENDED         PCRE_EXTENDED
582    
583       If this bit is set, whitespace data characters in  the  pat-       If this bit is set, whitespace data characters in  the  pat-
584       tern  are  totally  ignored  except when escaped or inside a       tern  are  totally  ignored  except when escaped or inside a
585       character class, and characters between an unescaped #  out-       character class. Whitespace does not include the VT  charac-
586       side  a  character  class  and  the  next newline character,       ter  (code 11). In addition, characters between an unescaped
587         # outside a character class and the next newline  character,
588       inclusive, are also ignored. This is equivalent to Perl's /x       inclusive, are also ignored. This is equivalent to Perl's /x
589       option,  and  makes  it  possible to include comments inside       option, and it can be changed within a  pattern  by  a  (?x)
590       complicated patterns. Note, however, that this applies  only       option setting.
591       to  data  characters. Whitespace characters may never appear  
592         This option makes it possible  to  include  comments  inside
593         complicated patterns.  Note, however, that this applies only
594         to data characters. Whitespace characters may  never  appear
595       within special character sequences in a pattern, for example       within special character sequences in a pattern, for example
596       within  the sequence (?( which introduces a conditional sub-       within the sequence (?( which introduces a conditional  sub-
597       pattern.       pattern.
598    
599         PCRE_EXTRA         PCRE_EXTRA
# Line 224  COMPILING A PATTERN Line 623  COMPILING A PATTERN
623       of  line"  constructs match immediately following or immedi-       of  line"  constructs match immediately following or immedi-
624       ately before any newline  in  the  subject  string,  respec-       ately before any newline  in  the  subject  string,  respec-
625       tively,  as  well  as  at  the  very  start and end. This is       tively,  as  well  as  at  the  very  start and end. This is
626       equivalent to Perl's /m option. If there are no "\n" charac-       equivalent to Perl's /m option, and it can be changed within
627       ters  in  a subject string, or no occurrences of ^ or $ in a       a  pattern  by  a  (?m) option setting. If there are no "\n"
628       pattern, setting PCRE_MULTILINE has no effect.       characters in a subject string, or no occurrences of ^ or  $
629         in a pattern, setting PCRE_MULTILINE has no effect.
630    
631           PCRE_NO_AUTO_CAPTURE
632    
633         If this option is set, it disables the use of numbered  cap-
634         turing  parentheses  in the pattern. Any opening parenthesis
635         that is not followed by ? behaves as if it were followed  by
636         ?:  but  named  parentheses  can still be used for capturing
637         (and they acquire numbers in the usual  way).  There  is  no
638         equivalent of this option in Perl.
639    
640         PCRE_UNGREEDY         PCRE_UNGREEDY
641    
# Line 235  COMPILING A PATTERN Line 644  COMPILING A PATTERN
644       followed by "?". It is not compatible with Perl. It can also       followed by "?". It is not compatible with Perl. It can also
645       be set by a (?U) option setting within the pattern.       be set by a (?U) option setting within the pattern.
646    
647           PCRE_UTF8
648    
649         This option causes PCRE to regard both the pattern  and  the
650         subject  as  strings  of UTF-8 characters instead of single-
651         byte character strings. However, it  is  available  only  if
652         PCRE  has  been  built to include UTF-8 support. If not, the
653         use of this option provokes an error. Details  of  how  this
654         option  changes  the behaviour of PCRE are given in the sec-
655         tion on UTF-8 support in the main pcre page.
656    
657    
658  STUDYING A PATTERN  STUDYING A PATTERN
659    
660         pcre_extra *pcre_study(const pcre *code, int options,
661              const char **errptr);
662    
663       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
664       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
665       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
666       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to  a compiled pattern as its first argument. If
667       returns a  pointer  to  a  pcre_extra  block  (another  void       studing the pattern  produces  additional  information  that
668       typedef)  containing  additional  information about the pat-       will  help speed up matching, pcre_study() returns a pointer
669       tern; this can be passed to pcre_exec().  If  no  additional       to a pcre_extra block, in which the study_data field  points
670       information is available, NULL is returned.       to the results of the study.
671    
672         The  returned  value  from  a  pcre_study()  can  be  passed
673         directly  to pcre_exec(). However, the pcre_extra block also
674         contains other fields that can be set by the  caller  before
675         the  block is passed; these are described below. If studying
676         the pattern does not  produce  any  additional  information,
677         pcre_study() returns NULL. In that circumstance, if the cal-
678         ling program wants to pass  some  of  the  other  fields  to
679         pcre_exec(), it must set up its own pcre_extra block.
680    
681       The second argument contains option  bits.  At  present,  no       The second argument contains option  bits.  At  present,  no
682       options  are  defined  for  pcre_study(),  and this argument       options  are  defined  for  pcre_study(),  and this argument
683       should always be zero.       should always be zero.
684    
685       The third argument for pcre_study() is a pointer to an error       The third argument for pcre_study()  is  a  pointer  for  an
686       message. If studying succeeds (even if no data is returned),       error  message.  If  studying  succeeds  (even if no data is
687       the variable it points to  is  set  to  NULL.  Otherwise  it       returned), the variable it points to is set to NULL.  Other-
688       points to a textual error message.       wise it points to a textual error message. You should there-
689         fore  test  the  error  pointer  for  NULL   after   calling
690         pcre_study(), to be sure that it has run successfully.
691    
692         This is a typical call to pcre_study():
693    
694           pcre_extra *pe;
695           pe = pcre_study(
696             re,             /* result of pcre_compile() */
697             0,              /* no options exist */
698             &error);        /* set to NULL or points to a message */
699    
700       At present, studying a  pattern  is  useful  only  for  non-       At present, studying a  pattern  is  useful  only  for  non-
701       anchored  patterns  that do not have a single fixed starting       anchored  patterns  that do not have a single fixed starting
# Line 262  STUDYING A PATTERN Line 703  STUDYING A PATTERN
703       created.       created.
704    
705    
   
706  LOCALE SUPPORT  LOCALE SUPPORT
707    
708       PCRE handles caseless matching, and determines whether char-       PCRE handles caseless matching, and determines whether char-
709       acters  are  letters, digits, or whatever, by reference to a       acters  are  letters, digits, or whatever, by reference to a
710       set of tables. The library contains a default set of  tables       set of tables. When running in UTF-8 mode, this applies only
711       which  is  created in the default C locale when PCRE is com-       to characters with codes less than 256. The library contains
712       piled.  This  is   used   when   the   final   argument   of       a default set of tables that is created  in  the  default  C
713       pcre_compile()  is NULL, and is sufficient for many applica-       locale  when  PCRE  is compiled. This is used when the final
714       tions.       argument of pcre_compile() is NULL, and  is  sufficient  for
715         many applications.
716    
717       An alternative set of tables can, however, be supplied. Such       An alternative set of tables can, however, be supplied. Such
718       tables  are built by calling the pcre_maketables() function,       tables  are built by calling the pcre_maketables() function,
# Line 288  LOCALE SUPPORT Line 730  LOCALE SUPPORT
730       The  tables  are  built  in  memory  that  is  obtained  via       The  tables  are  built  in  memory  that  is  obtained  via
731       pcre_malloc.  The  pointer that is passed to pcre_compile is       pcre_malloc.  The  pointer that is passed to pcre_compile is
732       saved with the compiled pattern, and  the  same  tables  are       saved with the compiled pattern, and  the  same  tables  are
733       used  via this pointer by pcre_study() and pcre_exec(). Thus       used via this pointer by pcre_study() and pcre_exec(). Thus,
734       for any single pattern, compilation, studying  and  matching       for any single pattern, compilation, studying  and  matching
735       all happen in the same locale, but different patterns can be       all happen in the same locale, but different patterns can be
736       compiled in different locales. It is the caller's  responsi-       compiled in different locales. It is the caller's  responsi-
# Line 296  LOCALE SUPPORT Line 738  LOCALE SUPPORT
738       remains available for as long as it is needed.       remains available for as long as it is needed.
739    
740    
   
741  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
742    
743         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744              int what, void *where);
745    
746       The pcre_fullinfo() function  returns  information  about  a       The pcre_fullinfo() function  returns  information  about  a
747       compiled pattern. It replaces the obsolete pcre_info() func-       compiled pattern. It replaces the obsolete pcre_info() func-
748       tion, which is nevertheless retained for backwards compabil-       tion, which is nevertheless retained for backwards compabil-
# Line 307  INFORMATION ABOUT A PATTERN Line 752  INFORMATION ABOUT A PATTERN
752       compiled  pattern.  The  second  argument  is  the result of       compiled  pattern.  The  second  argument  is  the result of
753       pcre_study(), or NULL if the pattern was  not  studied.  The       pcre_study(), or NULL if the pattern was  not  studied.  The
754       third  argument  specifies  which  piece  of  information is       third  argument  specifies  which  piece  of  information is
755       required, while the fourth argument is a pointer to a  vari-       required, and the fourth argument is a pointer to a variable
756       able  to receive the data. The yield of the function is zero       to  receive  the data. The yield of the function is zero for
757       for success, or one of the following negative numbers:       success, or one of the following negative numbers:
758    
759         PCRE_ERROR_NULL       the argument code was NULL         PCRE_ERROR_NULL       the argument code was NULL
760                               the argument where was NULL                               the argument where was NULL
761         PCRE_ERROR_BADMAGIC   the "magic number" was not found         PCRE_ERROR_BADMAGIC   the "magic number" was not found
762         PCRE_ERROR_BADOPTION  the value of what was invalid         PCRE_ERROR_BADOPTION  the value of what was invalid
763    
764       The possible values for the third argument  are  defined  in       Here is a typical call of  pcre_fullinfo(),  to  obtain  the
765       pcre.h, and are as follows:       length of the compiled pattern:
766    
767         PCRE_INFO_OPTIONS         int rc;
768           unsigned long int length;
769           rc = pcre_fullinfo(
770             re,               /* result of pcre_compile() */
771             pe,               /* result of pcre_study(), or NULL */
772             PCRE_INFO_SIZE,   /* what is required */
773             &length);         /* where to put the data */
774    
775       Return a copy of the options with which the pattern was com-       The possible values for the third argument  are  defined  in
776       piled.  The fourth argument should point to au unsigned long       pcre.h, and are as follows:
      int variable. These option bits are those specified  in  the  
      call  to  pcre_compile(),  modified  by any top-level option  
      settings  within  the   pattern   itself,   and   with   the  
      PCRE_ANCHORED  bit  forcibly  set if the form of the pattern  
      implies that it can match only at the  start  of  a  subject  
      string.  
777    
778         PCRE_INFO_SIZE         PCRE_INFO_BACKREFMAX
779    
780       Return the size of the compiled pattern, that is, the  value       Return the number of the highest back reference in the  pat-
781       that  was  passed as the argument to pcre_malloc() when PCRE       tern.  The  fourth argument should point to an int variable.
782       was getting memory in which to place the compiled data.  The       Zero is returned if there are no back references.
      fourth argument should point to a size_t variable.  
783    
784         PCRE_INFO_CAPTURECOUNT         PCRE_INFO_CAPTURECOUNT
785    
786       Return the number of capturing subpatterns in  the  pattern.       Return the number of capturing subpatterns in  the  pattern.
787       The fourth argument should point to an int variable.       The fourth argument should point to an int variable.
788    
789         PCRE_INFO_BACKREFMAX         PCRE_INFO_FIRSTBYTE
   
      Return the number of the highest back reference in the  pat-  
      tern.  The  fourth argument should point to an int variable.  
      Zero is returned if there are no back references.  
790    
791         PCRE_INFO_FIRSTCHAR       Return information about  the  first  byte  of  any  matched
792         string,  for a non-anchored pattern. (This option used to be
793         called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794         for backwards compatibility.)
795    
796       Return information about the first character of any  matched       If there is a fixed first byte, e.g. from a pattern such  as
      string,  for  a  non-anchored  pattern.  If there is a fixed  
      first   character,   e.g.   from   a   pattern    such    as  
797       (cat|cow|coyote),  it  is returned in the integer pointed to       (cat|cow|coyote),  it  is returned in the integer pointed to
798       by where. Otherwise, if either       by where. Otherwise, if either
799    
# Line 364  INFORMATION ABOUT A PATTERN Line 805  INFORMATION ABOUT A PATTERN
805       anchored),       anchored),
806    
807       -1 is returned, indicating that the pattern matches only  at       -1 is returned, indicating that the pattern matches only  at
808       the  start  of a subject string or after any "\n" within the       the  start  of  a subject string or after any newline within
809       string. Otherwise -2 is returned.  For anchored patterns, -2       the string. Otherwise -2 is returned. For anchored patterns,
810       is returned.       -2 is returned.
811    
812         PCRE_INFO_FIRSTTABLE         PCRE_INFO_FIRSTTABLE
813    
814       If the pattern was studied, and this resulted  in  the  con-       If the pattern was studied, and this resulted  in  the  con-
815       struction of a 256-bit table indicating a fixed set of char-       struction of a 256-bit table indicating a fixed set of bytes
816       acters for the first character in  any  matching  string,  a       for the first byte in any matching string, a pointer to  the
817       pointer   to  the  table  is  returned.  Otherwise  NULL  is       table  is  returned.  Otherwise NULL is returned. The fourth
818       returned. The fourth argument should point  to  an  unsigned       argument should point to an unsigned char * variable.
      char * variable.  
819    
820         PCRE_INFO_LASTLITERAL         PCRE_INFO_LASTLITERAL
821    
822       For a non-anchored pattern, return the value of  the  right-       Return the value of the rightmost  literal  byte  that  must
823       most  literal  character  which  must  exist  in any matched       exist  in  any  matched  string, other than at its start, if
824       string, other than at its start. The fourth argument  should       such a byte has been recorded. The  fourth  argument  should
825       point  to an int variable. If there is no such character, or       point  to  an  int variable. If there is no such byte, -1 is
826       if the pattern is anchored, -1 is returned. For example, for       returned. For anchored patterns,  a  last  literal  byte  is
827       the pattern /a\d+z\d+/ the returned value is 'z'.       recorded  only  if  it follows something of variable length.
828         For example, for the pattern /^a\d+z\d+/ the returned  value
829         is "z", but for /^a\dz\d/ the returned value is -1.
830    
831           PCRE_INFO_NAMECOUNT
832           PCRE_INFO_NAMEENTRYSIZE
833           PCRE_INFO_NAMETABLE
834    
835         PCRE supports the use of named as well as numbered capturing
836         parentheses. The names are just an additional way of identi-
837         fying the parentheses,  which  still  acquire  a  number.  A
838         caller  that  wants  to extract data from a named subpattern
839         must convert the name to a number in  order  to  access  the
840         correct  pointers  in  the  output  vector  (described  with
841         pcre_exec() below). In order to do this, it must  first  use
842         these  three  values  to  obtain  the name-to-number mapping
843         table for the pattern.
844    
845         The  map  consists  of  a  number  of  fixed-size   entries.
846         PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and
847         PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both
848         of  these return an int value. The entry size depends on the
849         length of the longest name.  PCRE_INFO_NAMETABLE  returns  a
850         pointer to the first entry of the table (a pointer to char).
851         The first two bytes of each entry are the number of the cap-
852         turing parenthesis, most significant byte first. The rest of
853         the entry is the corresponding name,  zero  terminated.  The
854         names  are  in alphabetical order. For example, consider the
855         following pattern (assume PCRE_EXTENDED  is  set,  so  white
856         space - including newlines - is ignored):
857    
858           (?P<date> (?P<year>(\d\d)?\d\d) -
859           (?P<month>\d\d) - (?P<day>\d\d) )
860    
861         There are four named subpatterns,  so  the  table  has  four
862         entries,  and  each  entry in the table is eight bytes long.
863         The table is as follows, with non-printing  bytes  shows  in
864         hex, and undefined bytes shown as ??:
865    
866           00 01 d  a  t  e  00 ??
867           00 05 d  a  y  00 ?? ??
868           00 04 m  o  n  t  h  00
869           00 02 y  e  a  r  00 ??
870    
871         When writing code to extract data  from  named  subpatterns,
872         remember  that the length of each entry may be different for
873         each compiled pattern.
874    
875           PCRE_INFO_OPTIONS
876    
877         Return a copy of the options with which the pattern was com-
878         piled.  The fourth argument should point to an unsigned long
879         int variable. These option bits are those specified  in  the
880         call  to  pcre_compile(),  modified  by any top-level option
881         settings within the pattern itself.
882    
883         A pattern is automatically anchored by PCRE if  all  of  its
884         top-level alternatives begin with one of the following:
885    
886           ^     unless PCRE_MULTILINE is set
887           \A    always
888           \G    always
889           .*    if PCRE_DOTALL is set and there are no back
890                   references to the subpattern in which .* appears
891    
892         For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the
893         options returned by pcre_fullinfo().
894    
895           PCRE_INFO_SIZE
896    
897         Return the size of the compiled pattern, that is, the  value
898         that  was  passed as the argument to pcre_malloc() when PCRE
899         was getting memory in which to place the compiled data.  The
900         fourth argument should point to a size_t variable.
901    
902           PCRE_INFO_STUDYSIZE
903    
904         Returns the size  of  the  data  block  pointed  to  by  the
905         study_data  field  in a pcre_extra block. That is, it is the
906         value that was passed to pcre_malloc() when PCRE was getting
907         memory into which to place the data created by pcre_study().
908         The fourth argument should point to a size_t variable.
909    
910    
911    OBSOLETE INFO FUNCTION
912    
913         int pcre_info(const pcre *code, int *optptr, *firstcharptr);
914    
915       The pcre_info() function is now obsolete because its  inter-       The pcre_info() function is now obsolete because its  inter-
916       face  is  too  restrictive  to return all the available data       face  is  too  restrictive  to return all the available data
# Line 403  INFORMATION ABOUT A PATTERN Line 929  INFORMATION ABOUT A PATTERN
929       If the pattern is not anchored and the firstcharptr argument       If the pattern is not anchored and the firstcharptr argument
930       is  not  NULL, it is used to pass back information about the       is  not  NULL, it is used to pass back information about the
931       first    character    of    any    matched    string    (see       first    character    of    any    matched    string    (see
932       PCRE_INFO_FIRSTCHAR above).       PCRE_INFO_FIRSTBYTE above).
   
933    
934    
935  MATCHING A PATTERN  MATCHING A PATTERN
936    
937         int pcre_exec(const pcre *code, const pcre_extra *extra,
938              const char *subject, int length, int startoffset,
939              int options, int *ovector, int ovecsize);
940    
941       The function pcre_exec() is called to match a subject string       The function pcre_exec() is called to match a subject string
942       against  a pre-compiled pattern, which is passed in the code       against  a pre-compiled pattern, which is passed in the code
943       argument. If the pattern has been studied, the result of the       argument. If the pattern has been studied, the result of the
944       study should be passed in the extra argument. Otherwise this       study should be passed in the extra argument.
945       must be NULL.  
946         Here is an example of a simple call to pcre_exec():
947    
948           int rc;
949           int ovector[30];
950           rc = pcre_exec(
951             re,             /* result of pcre_compile() */
952             NULL,           /* we didn't study the pattern */
953             "some string",  /* the subject string */
954             11,             /* the length of the subject string */
955             0,              /* start at offset 0 in the subject */
956             0,              /* default options */
957             ovector,        /* vector for substring information */
958             30);            /* number of elements in the vector */
959    
960         If the extra argument is  not  NULL,  it  must  point  to  a
961         pcre_extra  data  block.  The  pcre_study() function returns
962         such a block (when it doesn't return NULL), but you can also
963         create  one for yourself, and pass additional information in
964         it. The fields in the block are as follows:
965    
966           unsigned long int flags;
967           void *study_data;
968           unsigned long int match_limit;
969           void *callout_data;
970    
971         The flags field is a bitmap  that  specifies  which  of  the
972         other fields are set. The flag bits are:
973    
974           PCRE_EXTRA_STUDY_DATA
975           PCRE_EXTRA_MATCH_LIMIT
976           PCRE_EXTRA_CALLOUT_DATA
977    
978         Other flag bits should be set to zero. The study_data  field
979         is   set  in  the  pcre_extra  block  that  is  returned  by
980         pcre_study(), together with the appropriate  flag  bit.  You
981         should  not  set this yourself, but you can add to the block
982         by setting the other fields.
983    
984         The match_limit field provides a means  of  preventing  PCRE
985         from  using  up a vast amount of resources when running pat-
986         terns that are not going to match, but  which  have  a  very
987         large  number  of  possibilities  in their search trees. The
988         classic example is the  use  of  nested  unlimited  repeats.
989         Internally,  PCRE  uses  a  function called match() which it
990         calls  repeatedly  (sometimes  recursively).  The  limit  is
991         imposed  on the number of times this function is called dur-
992         ing a match, which has the effect of limiting the amount  of
993         recursion and backtracking that can take place. For patterns
994         that are not anchored, the count starts from zero  for  each
995         position in the subject string.
996    
997         The default limit for the library can be set  when  PCRE  is
998         built;  the default default is 10 million, which handles all
999         but the most extreme cases. You can reduce  the  default  by
1000         suppling  pcre_exec()  with  a  pcre_extra  block  in  which
1001         match_limit   is   set   to    a    smaller    value,    and
1002         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the
1003         limit      is      exceeded,       pcre_exec()       returns
1004         PCRE_ERROR_MATCHLIMIT.
1005    
1006         The pcre_callout field is used in conjunction with the "cal-
1007         lout"  feature,  which is described in the pcrecallout docu-
1008         mentation.
1009    
1010       The PCRE_ANCHORED option can be passed in the options  argu-       The PCRE_ANCHORED option can be passed in the options  argu-
1011       ment,  whose unused bits must be zero. However, if a pattern       ment,   whose   unused   bits  must  be  zero.  This  limits
1012       was  compiled  with  PCRE_ANCHORED,  or  turned  out  to  be       pcre_exec() to matching at the first matching position. How-
1013       anchored  by  virtue  of  its  contents,  it  cannot be made       ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or
1014       unachored at matching time.       turned out to be anchored by virtue of its contents, it can-
1015         not be made unachored at matching time.
1016    
1017       There are also three further options that can be set only at       There are also three further options that can be set only at
1018       matching time:       matching time:
# Line 462  MATCHING A PATTERN Line 1056  MATCHING A PATTERN
1056       advancing the starting offset  (see  below)  and  trying  an       advancing the starting offset  (see  below)  and  trying  an
1057       ordinary match again.       ordinary match again.
1058    
1059       The subject string is passed as  a  pointer  in  subject,  a       The subject string is passed to pcre_exec() as a pointer  in
1060       length  in  length,  and  a  starting offset in startoffset.       subject,  a length in length, and a starting offset in star-
1061       Unlike the pattern string, it may contain binary zero  char-       toffset. Unlike the pattern string, the subject may  contain
1062       acters.  When  the starting offset is zero, the search for a       binary  zero  bytes.  When  the starting offset is zero, the
1063       match starts at the beginning of the subject, and this is by       search for a match starts at the beginning of  the  subject,
1064       far the most common case.       and this is by far the most common case.
1065    
1066         If the pattern was compiled with the PCRE_UTF8  option,  the
1067         subject  must  be  a sequence of bytes that is a valid UTF-8
1068         string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's
1069         behaviour is not defined.
1070    
1071       A non-zero starting offset  is  useful  when  searching  for       A non-zero starting offset  is  useful  when  searching  for
1072       another  match  in  the  same subject by calling pcre_exec()       another  match  in  the  same subject by calling pcre_exec()
# Line 560  MATCHING A PATTERN Line 1159  MATCHING A PATTERN
1159       Note that pcre_info() can be used to find out how many  cap-       Note that pcre_info() can be used to find out how many  cap-
1160       turing  subpatterns  there  are  in  a compiled pattern. The       turing  subpatterns  there  are  in  a compiled pattern. The
1161       smallest size for ovector that will  allow  for  n  captured       smallest size for ovector that will  allow  for  n  captured
1162       substrings  in  addition  to  the  offsets  of the substring       substrings,  in  addition  to  the  offsets of the substring
1163       matched by the whole pattern is (n+1)*3.       matched by the whole pattern, is (n+1)*3.
1164    
1165       If pcre_exec() fails, it returns a negative number. The fol-       If pcre_exec() fails, it returns a negative number. The fol-
1166       lowing are defined in the header file:       lowing are defined in the header file:
# Line 601  MATCHING A PATTERN Line 1200  MATCHING A PATTERN
1200       pcre_malloc() fails, this error  is  given.  The  memory  is       pcre_malloc() fails, this error  is  given.  The  memory  is
1201       freed at the end of matching.       freed at the end of matching.
1202    
1203           PCRE_ERROR_NOSUBSTRING    (-7)
1204    
1205         This   error   is   used   by   the   pcre_copy_substring(),
1206         pcre_get_substring(),  and  pcre_get_substring_list()  func-
1207         tions (see below). It is never returned by pcre_exec().
1208    
1209           PCRE_ERROR_MATCHLIMIT     (-8)
1210    
1211         The recursion and backtracking limit, as  specified  by  the
1212         match_limit  field  in a pcre_extra structure (or defaulted)
1213         was reached. See the description above.
1214    
1215           PCRE_ERROR_CALLOUT        (-9)
1216    
1217         This error is never generated by pcre_exec() itself.  It  is
1218         provided  for  use by callout functions that want to yield a
1219         distinctive error code. See  the  pcrecallout  documentation
1220         for details.
1221    
1222    
1223    EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1224    
1225         int pcre_copy_substring(const char *subject, int *ovector,
1226              int stringcount, int stringnumber, char *buffer,
1227              int buffersize);
1228    
1229         int pcre_get_substring(const char *subject, int *ovector,
1230              int stringcount, int stringnumber,
1231              const char **stringptr);
1232    
1233         int pcre_get_substring_list(const char *subject,
1234              int *ovector, int stringcount, const char ***listptr);
1235    
 EXTRACTING CAPTURED SUBSTRINGS  
1236       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
1237       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
1238       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
1239       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
1240       captured  substrings  as  new,   separate,   zero-terminated       captured  substrings  as  new,   separate,   zero-terminated
1241         strings.  These functions identify substrings by number. The
1242         next section describes functions for extracting  named  sub-
1243       strings.   A  substring  that  contains  a  binary  zero  is       strings.   A  substring  that  contains  a  binary  zero  is
1244       correctly extracted and has a further zero added on the end,       correctly extracted and has a further zero added on the end,
1245       but the result does not, of course, function as a C string.       but the result is not, of course, a C string.
1246    
1247       The first three arguments are the same for all  three  func-       The first three arguments are the  same  for  all  three  of
1248       tions:  subject  is  the  subject string which has just been       these  functions:   subject  is the subject string which has
1249       successfully matched, ovector is a pointer to the vector  of       just been successfully matched, ovector is a pointer to  the
1250       integer   offsets   that  was  passed  to  pcre_exec(),  and       vector  of  integer  offsets that was passed to pcre_exec(),
1251       stringcount is the number of substrings that  were  captured       and stringcount is the number of substrings that  were  cap-
1252       by  the  match,  including  the  substring  that matched the       tured by the match, including the substring that matched the
1253       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
1254       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
1255       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
# Line 631  EXTRACTING CAPTURED SUBSTRINGS Line 1262  EXTRACTING CAPTURED SUBSTRINGS
1262       the entire pattern, while higher values extract the captured       the entire pattern, while higher values extract the captured
1263       substrings. For pcre_copy_substring(), the string is  placed       substrings. For pcre_copy_substring(), the string is  placed
1264       in  buffer,  whose  length is given by buffersize, while for       in  buffer,  whose  length is given by buffersize, while for
1265       pcre_get_substring() a new block of store  is  obtained  via       pcre_get_substring() a new block of memory is  obtained  via
1266       pcre_malloc,  and its address is returned via stringptr. The       pcre_malloc,  and its address is returned via stringptr. The
1267       yield of the function is  the  length  of  the  string,  not       yield of the function is  the  length  of  the  string,  not
1268       including the terminating zero, or one of       including the terminating zero, or one of
# Line 665  EXTRACTING CAPTURED SUBSTRINGS Line 1296  EXTRACTING CAPTURED SUBSTRINGS
1296       inspecting the appropriate offset in ovector, which is nega-       inspecting the appropriate offset in ovector, which is nega-
1297       tive for unset substrings.       tive for unset substrings.
1298    
1299         The  two  convenience  functions  pcre_free_substring()  and
1300         pcre_free_substring_list()  can  be  used to free the memory
1301         returned by  a  previous  call  of  pcre_get_substring()  or
1302         pcre_get_substring_list(),  respectively.  They  do  nothing
1303         more than call the function pointed to by  pcre_free,  which
1304         of  course  could  be called directly from a C program. How-
1305         ever, PCRE is used in some situations where it is linked via
1306         a  special  interface  to another programming language which
1307         cannot use pcre_free directly; it is for  these  cases  that
1308         the functions are provided.
1309    
1310    
1311    EXTRACTING CAPTURED SUBSTRINGS BY NAME
1312    
1313         int pcre_copy_named_substring(const pcre *code,
1314              const char *subject, int *ovector,
1315              int stringcount, const char *stringname,
1316              char *buffer, int buffersize);
1317    
1318         int pcre_get_stringnumber(const pcre *code,
1319              const char *name);
1320    
1321         int pcre_get_named_substring(const pcre *code,
1322              const char *subject, int *ovector,
1323              int stringcount, const char *stringname,
1324              const char **stringptr);
1325    
1326         To extract a substring by name, you first have to find asso-
1327         ciated    number.    This    can    be   done   by   calling
1328         pcre_get_stringnumber(). The first argument is the  compiled
1329         pattern,  and  the second is the name. For example, for this
1330         pattern
1331    
1332           ab(?<xxx>\d+)...
1333    
1334  LIMITATIONS       the number of the subpattern called "xxx" is  1.  Given  the
1335       There are some size limitations in PCRE but it is hoped that       number,  you can then extract the substring directly, or use
1336       they will never in practice be relevant.  The maximum length       one of the functions described in the previous section.  For
1337       of a compiled pattern is 65539 (sic) bytes.  All  values  in       convenience,  there are also two functions that do the whole
1338       repeating  quantifiers must be less than 65536.  The maximum       job.
1339       number of capturing subpatterns is 99.  The  maximum  number  
1340       of  all  parenthesized subpatterns, including capturing sub-       Most of the  arguments  of  pcre_copy_named_substring()  and
1341       patterns, assertions, and other types of subpattern, is 200.       pcre_get_named_substring()  are  the  same  as those for the
1342         functions that  extract  by  number,  and  so  are  not  re-
1343         described here. There are just two differences.
1344    
1345         First, instead of a substring number, a  substring  name  is
1346         given.  Second,  there  is  an  extra argument, given at the
1347         start, which is a pointer to the compiled pattern.  This  is
1348         needed  in order to gain access to the name-to-number trans-
1349         lation table.
1350    
1351         These functions  call  pcre_get_stringnumber(),  and  if  it
1352         succeeds,    they   then   call   pcre_copy_substring()   or
1353         pcre_get_substring(), as appropriate.
1354    
1355    Last updated: 03 February 2003
1356    Copyright (c) 1997-2003 University of Cambridge.
1357    -----------------------------------------------------------------------------
1358    
1359       The maximum length of a subject string is the largest  posi-  NAME
1360       tive number that an integer variable can hold. However, PCRE       PCRE - Perl-compatible regular expressions
1361       uses recursion to handle subpatterns and indefinite  repeti-  
1362       tion.  This  means  that the available stack space may limit  
1363       the size of a subject string that can be processed  by  cer-  PCRE CALLOUTS
1364       tain patterns.  
1365         int (*pcre_callout)(pcre_callout_block *);
1366    
1367         PCRE provides a feature called "callout", which is  a  means
1368         of  temporarily passing control to the caller of PCRE in the
1369         middle of pattern matching. The caller of PCRE  provides  an
1370         external  function  by putting its entry point in the global
1371         variable pcre_callout. By default,  this  variable  contains
1372         NULL, which disables all calling out.
1373    
1374         Within a regular expression, (?C) indicates  the  points  at
1375         which  the external function is to be called. Different cal-
1376         lout points can be identified by putting a number less  than
1377         256  after  the  letter  C.  The default value is zero.  For
1378         example, this pattern has two callout points:
1379    
1380           (?C1)9abc(?C2)def
1381    
1382         During matching, when PCRE  reaches  a  callout  point  (and
1383         pcre_callout  is  set), the external function is called. Its
1384         only argument is a pointer to  a  pcre_callout  block.  This
1385         contains the following variables:
1386    
1387           int          version;
1388           int          callout_number;
1389           int         *offset_vector;
1390           const char  *subject;
1391           int          subject_length;
1392           int          start_match;
1393           int          current_position;
1394           int          capture_top;
1395           int          capture_last;
1396           void        *callout_data;
1397    
1398         The version field  is  an  integer  containing  the  version
1399         number of the block format. The current version is zero. The
1400         version number may change in future if additional fields are
1401         added,  but  the  intention  is  never  to remove any of the
1402         existing fields.
1403    
1404         The callout_number field contains the number of the callout,
1405         as compiled into the pattern (that is, the number after ?C).
1406    
1407         The offset_vector field  is  a  pointer  to  the  vector  of
1408         offsets  that  was  passed by the caller to pcre_exec(). The
1409         contents can be inspected in  order  to  extract  substrings
1410         that  have  been  matched  so  far,  in  the same way as for
1411         extracting substrings after a match has completed.
1412         The subject and subject_length  fields  contain  copies  the
1413         values that were passed to pcre_exec().
1414    
1415         The start_match field contains the offset within the subject
1416         at  which  the current match attempt started. If the pattern
1417         is not anchored, the callout function may be called  several
1418         times for different starting points.
1419    
1420         The current_position field contains the  offset  within  the
1421         subject of the current match pointer.
1422    
1423         The capture_top field contains the  number  of  the  highest
1424         captured substring so far.
1425    
1426         The capture_last field  contains  the  number  of  the  most
1427         recently captured substring.
1428    
1429         The callout_data field contains a value that  is  passed  to
1430         pcre_exec()  by  the  caller  specifically so that it can be
1431         passed back in callouts. It is passed  in  the  pcre_callout
1432         field  of the pcre_extra data structure. If no such data was
1433         passed, the value of callout_data in a pcre_callout block is
1434         NULL.  There is a description of the pcre_extra structure in
1435         the pcreapi documentation.
1436    
1437    
1438    
1439    RETURN VALUES
1440    
1441         The callout function returns an integer.  If  the  value  is
1442         zero,  matching  proceeds as normal. If the value is greater
1443         than zero, matching fails at the current  point,  but  back-
1444         tracking  to test other possibilities goes ahead, just as if
1445         a lookahead assertion had failed. If the value is less  than
1446         zero,  the  match  is abandoned, and pcre_exec() returns the
1447         value.
1448    
1449         Negative values should normally be chosen from  the  set  of
1450         PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH
1451         forces a standard "no  match"  failure.   The  error  number
1452         PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1453         it will never be used by PCRE itself.
1454    
1455    Last updated: 21 January 2003
1456    Copyright (c) 1997-2003 University of Cambridge.
1457    -----------------------------------------------------------------------------
1458    
1459    NAME
1460         PCRE - Perl-compatible regular expressions
1461    
1462    
1463  DIFFERENCES FROM PERL  DIFFERENCES FROM PERL
      The differences described here  are  with  respect  to  Perl  
      5.005.  
1464    
1465       1. By default, a whitespace character is any character  that       This document describes the differences  in  the  ways  that
1466       the  C  library  function isspace() recognizes, though it is       PCRE  and  Perl  handle regular expressions. The differences
1467       possible to compile PCRE  with  alternative  character  type       described here are with respect to Perl 5.8.
      tables. Normally isspace() matches space, formfeed, newline,  
      carriage return, horizontal tab, and vertical tab. Perl 5 no  
      longer  includes vertical tab in its set of whitespace char-  
      acters. The \v escape that was in the Perl documentation for  
      a long time was never in fact recognized. However, the char-  
      acter itself was treated as whitespace at least up to 5.002.  
      In 5.004 and 5.005 it does not match \s.  
1468    
1469       2. PCRE does  not  allow  repeat  quantifiers  on  lookahead       1. PCRE does  not  allow  repeat  quantifiers  on  lookahead
1470       assertions. Perl permits them, but they do not mean what you       assertions. Perl permits them, but they do not mean what you
1471       might think. For example, (?!a){3} does not assert that  the       might think. For example, (?!a){3} does not assert that  the
1472       next  three characters are not "a". It just asserts that the       next  three characters are not "a". It just asserts that the
1473       next character is not "a" three times.       next character is not "a" three times.
1474    
1475       3. Capturing subpatterns that occur inside  negative  looka-       2. Capturing subpatterns that occur inside  negative  looka-
1476       head  assertions  are  counted,  but  their  entries  in the       head  assertions  are  counted,  but  their  entries  in the
1477       offsets vector are never set. Perl sets its numerical  vari-       offsets vector are never set. Perl sets its numerical  vari-
1478       ables  from  any  such  patterns that are matched before the       ables  from  any  such  patterns that are matched before the
# Line 715  DIFFERENCES FROM PERL Line 1480  DIFFERENCES FROM PERL
1480       only  if  the negative lookahead assertion contains just one       only  if  the negative lookahead assertion contains just one
1481       branch.       branch.
1482    
1483       4. Though binary zero characters are supported in  the  sub-       3. Though binary zero characters are supported in  the  sub-
1484       ject  string,  they  are  not  allowed  in  a pattern string       ject  string,  they  are  not  allowed  in  a pattern string
1485       because it is passed as a normal  C  string,  terminated  by       because it is passed as a normal  C  string,  terminated  by
1486       zero. The escape sequence "\0" can be used in the pattern to       zero. The escape sequence "\0" can be used in the pattern to
1487       represent a binary zero.       represent a binary zero.
1488    
1489       5. The following Perl escape sequences  are  not  supported:       4. The following Perl escape sequences  are  not  supported:
1490       \l,  \u,  \L,  \U,  \E, \Q. In fact these are implemented by       \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-
1491       Perl's general string-handling and are not part of its  pat-       mented by Perl's general string-handling and are not part of
1492       tern matching engine.       its pattern matching engine. If any of these are encountered
1493         by PCRE, an error is generated.
1494    
1495         5. PCRE does support the \Q...\E  escape  for  quoting  sub-
1496         strings. Characters in between are treated as literals. This
1497         is slightly different from Perl in that $  and  @  are  also
1498         handled  as  literals inside the quotes. In Perl, they cause
1499         variable interpolation (but of course  PCRE  does  not  have
1500         variables). Note the following examples:
1501    
1502             Pattern            PCRE matches      Perl matches
1503    
1504             \Qabc$xyz\E        abc$xyz           abc followed by the
1505                                                    contents of $xyz
1506             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1507             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1508    
1509       6. The Perl \G assertion is  not  supported  as  it  is  not       In PCRE, the \Q...\E mechanism is not  recognized  inside  a
1510       relevant to single pattern matches.       character class.
1511    
1512       7. Fairly obviously, PCRE does not support the (?{code}) and       8. Fairly obviously, PCRE does not support the (?{code}) and
1513       (?p{code})  constructions. However, there is some experimen-       (?p{code})  constructions. However, there is some experimen-
1514       tal support for recursive patterns using the  non-Perl  item       tal support for recursive patterns using the non-Perl  items
1515       (?R).       (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"
1516       8. There are at the time of writing some  oddities  in  Perl       feature allows an external function to be called during pat-
1517       5.005_02  concerned  with  the  settings of captured strings       tern matching.
1518       when part of a pattern is repeated.  For  example,  matching  
1519       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       9. There are some differences that are  concerned  with  the
1520       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       settings  of  captured  strings  when  part  of a pattern is
1521       unset.    However,    if   the   pattern   is   changed   to       repeated. For example, matching "aba"  against  the  pattern
1522       /^(aa(b(b))?)+$/ then $2 (and $3) are set.       /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set
1523         to "b".
      In Perl 5.004 $2 is set in both cases, and that is also true  
      of PCRE. If in the future Perl changes to a consistent state  
      that is different, PCRE may change to follow.  
   
      9. Another as yet unresolved discrepancy  is  that  in  Perl  
      5.005_02  the  pattern /^(a)?(?(1)a|b)+$/ matches the string  
      "a", whereas in PCRE it does not.  However, in both Perl and  
      PCRE /^(a)?a/ matched against "a" leaves $1 unset.  
1524    
1525       10. PCRE  provides  some  extensions  to  the  Perl  regular       10. PCRE  provides  some  extensions  to  the  Perl  regular
1526       expression facilities:       expression facilities:
1527    
1528       (a) Although lookbehind assertions must match  fixed  length       (a) Although lookbehind assertions must match  fixed  length
1529       strings,  each  alternative branch of a lookbehind assertion       strings,  each  alternative branch of a lookbehind assertion
1530       can match a different length of string. Perl 5.005  requires       can match a different length of string. Perl  requires  them
1531       them all to have the same length.       all to have the same length.
1532    
1533       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not
1534       set,  the  $ meta- character matches only at the very end of       set,  the  $  meta-character matches only at the very end of
1535       the string.       the string.
1536    
1537       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
# Line 770  DIFFERENCES FROM PERL Line 1542  DIFFERENCES FROM PERL
1542       not greedy, but if followed by a question mark they are.       not greedy, but if followed by a question mark they are.
1543    
1544       (e) PCRE_ANCHORED can be used to force a pattern to be tried       (e) PCRE_ANCHORED can be used to force a pattern to be tried
1545       only at the start of the subject.       only at the first matching position in the subject string.
1546    
1547         (f)  The  PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   and
1548         PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl
1549         equivalents.
1550    
1551         (g) The (?R), (?number), and (?P>name) constructs allows for
1552         recursive  pattern  matching  (Perl  can  do  this using the
1553         (?p{code}) construct, which PCRE cannot support.)
1554    
1555         (h) PCRE supports  named  capturing  substrings,  using  the
1556         Python syntax.
1557    
1558         (i) PCRE supports the  possessive  quantifier  "++"  syntax,
1559         taken from Sun's Java package.
1560    
1561         (j) The (R) condition, for  testing  recursion,  is  a  PCRE
1562         extension.
1563    
1564         (k) The callout facility is PCRE-specific.
1565    
1566       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options  Last updated: 03 February 2003
1567       for pcre_exec() have no Perl equivalents.  Copyright (c) 1997-2003 University of Cambridge.
1568    -----------------------------------------------------------------------------
1569    
1570       (g) The (?R) construct allows for recursive pattern matching  NAME
1571       (Perl  5.6 can do this using the (?p{code}) construct, which       PCRE - Perl-compatible regular expressions
      PCRE cannot of course support.)  
1572    
1573    
1574    PCRE REGULAR EXPRESSION DETAILS
1575    
 REGULAR EXPRESSION DETAILS  
1576       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
1577       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
1578       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
   
1579       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
1580       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
1581       O'Reilly  (ISBN  1-56592-257),  covers them in great detail.       O'Reilly,  covers them in great detail. The description here
1582       The description here is intended as reference documentation.       is intended as reference documentation.
1583    
1584         The basic operation of PCRE is on strings of bytes. However,
1585         there  is  also  support for UTF-8 character strings. To use
1586         this support you must build PCRE to include  UTF-8  support,
1587         and  then call pcre_compile() with the PCRE_UTF8 option. How
1588         this affects the pattern matching is  mentioned  in  several
1589         places  below.  There is also a summary of UTF-8 features in
1590         the section on UTF-8 support in the main pcre page.
1591    
1592       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
1593       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 811  REGULAR EXPRESSION DETAILS Line 1609  REGULAR EXPRESSION DETAILS
1609       Outside square brackets, the meta-characters are as follows:       Outside square brackets, the meta-characters are as follows:
1610    
1611         \      general escape character with several uses         \      general escape character with several uses
1612         ^      assert start of  subject  (or  line,  in  multiline         ^      assert start of string (or line, in multiline mode)
1613       mode)         $      assert end of string (or line, in multiline mode)
        $      assert end of subject (or line, in multiline mode)  
1614         .      match any character except newline (by default)         .      match any character except newline (by default)
1615         [      start character class definition         [      start character class definition
1616         |      start of alternative branch         |      start of alternative branch
# Line 824  REGULAR EXPRESSION DETAILS Line 1621  REGULAR EXPRESSION DETAILS
1621                also quantifier minimizer                also quantifier minimizer
1622         *      0 or more quantifier         *      0 or more quantifier
1623         +      1 or more quantifier         +      1 or more quantifier
1624                  also "possessive quantifier"
1625         {      start min/max quantifier         {      start min/max quantifier
1626    
1627       Part of a pattern that is in square  brackets  is  called  a       Part of a pattern that is in square  brackets  is  called  a
# Line 833  REGULAR EXPRESSION DETAILS Line 1631  REGULAR EXPRESSION DETAILS
1631         \      general escape character         \      general escape character
1632         ^      negate the class, but only if the first character         ^      negate the class, but only if the first character
1633         -      indicates character range         -      indicates character range
1634           [      POSIX character class (only if followed by POSIX
1635                    syntax)
1636         ]      terminates the character class         ]      terminates the character class
1637    
1638       The following sections describe  the  use  of  each  of  the       The following sections describe  the  use  of  each  of  the
1639       meta-characters.       meta-characters.
1640    
1641    
   
1642  BACKSLASH  BACKSLASH
1643    
1644       The backslash character has several uses. Firstly, if it  is       The backslash character has several uses. Firstly, if it  is
1645       followed  by  a  non-alphameric character, it takes away any       followed  by  a  non-alphameric character, it takes away any
1646       special  meaning  that  character  may  have.  This  use  of       special  meaning  that  character  may  have.  This  use  of
1647       backslash  as  an  escape  character applies both inside and       backslash  as  an  escape  character applies both inside and
1648       outside character classes.       outside character classes.
1649    
1650       For example, if you want to match a "*" character, you write       For example, if you want to match a * character,  you  write
1651       "\*" in the pattern. This applies whether or not the follow-       \*  in the pattern.  This escaping action applies whether or
1652       ing character would otherwise  be  interpreted  as  a  meta-       not the following character would otherwise  be  interpreted
1653       character,  so it is always safe to precede a non-alphameric       as  a meta-character, so it is always safe to precede a non-
1654       with "\" to specify that it stands for itself.  In  particu-       alphameric with backslash to  specify  that  it  stands  for
1655       lar, if you want to match a backslash, you write "\\".       itself. In particular, if you want to match a backslash, you
1656         write \\.
1657    
1658       If a pattern is compiled with the PCRE_EXTENDED option, whi-       If a pattern is compiled with the PCRE_EXTENDED option, whi-
1659       tespace in the pattern (other than in a character class) and       tespace in the pattern (other than in a character class) and
1660       characters between a "#" outside a character class  and  the       characters between a # outside a  character  class  and  the
1661       next  newline  character  are ignored. An escaping backslash       next  newline  character  are ignored. An escaping backslash
1662       can be used to include a whitespace or "#" character as part       can be used to include a whitespace or # character  as  part
1663       of the pattern.       of the pattern.
1664    
1665         If you want to remove the special meaning from a sequence of
1666         characters, you can do so by putting them between \Q and \E.
1667         This is different from Perl in that $ and @ are  handled  as
1668         literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $
1669         and @ cause variable interpolation. Note the following exam-
1670         ples:
1671    
1672           Pattern            PCRE matches   Perl matches
1673    
1674           \Qabc$xyz\E        abc$xyz        abc followed by the
1675    
1676                                               contents of $xyz
1677           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1678           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1679    
1680         The \Q...\E sequence is recognized both inside  and  outside
1681         character classes.
1682    
1683       A second use of backslash provides a way  of  encoding  non-       A second use of backslash provides a way  of  encoding  non-
1684       printing  characters  in patterns in a visible manner. There       printing  characters  in patterns in a visible manner. There
1685       is no restriction on the appearance of non-printing  charac-       is no restriction on the appearance of non-printing  charac-
# Line 869  BACKSLASH Line 1688  BACKSLASH
1688       usually  easier to use one of the following escape sequences       usually  easier to use one of the following escape sequences
1689       than the binary character it represents:       than the binary character it represents:
1690    
1691         \a     alarm, that is, the BEL character (hex 07)         \a        alarm, that is, the BEL character (hex 07)
1692         \cx    "control-x", where x is any character         \cx       "control-x", where x is any character
1693         \e     escape (hex 1B)         \e        escape (hex 1B)
1694         \f     formfeed (hex 0C)         \f        formfeed (hex 0C)
1695         \n     newline (hex 0A)         \n        newline (hex 0A)
1696         \r     carriage return (hex 0D)         \r        carriage return (hex 0D)
1697         \t     tab (hex 09)         \t        tab (hex 09)
1698         \xhh   character with hex code hh         \ddd      character with octal code ddd, or backreference
1699         \ddd   character with octal code ddd, or backreference         \xhh      character with hex code hh
1700           \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1701    
1702       The precise effect of "\cx" is as follows: if "x" is a lower       The precise effect of \cx is as follows: if  x  is  a  lower
1703       case  letter,  it  is converted to upper case. Then bit 6 of       case  letter,  it  is converted to upper case. Then bit 6 of
1704       the character (hex 40) is inverted.  Thus "\cz" becomes  hex       the character (hex 40) is inverted.  Thus  \cz  becomes  hex
1705       1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.       1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1706    
1707       After "\x", up to two hexadecimal digits are  read  (letters       After \x, from zero  to  two  hexadecimal  digits  are  read
1708       can be in upper or lower case).       (letters  can be in upper or lower case). In UTF-8 mode, any
1709         number of hexadecimal digits may appear between \x{  and  },
1710         but  the value of the character code must be less than 2**31
1711         (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If
1712         characters  other than hexadecimal digits appear between \x{
1713         and }, or if there is no terminating }, this form of  escape
1714         is  not  recognized.  Instead, the initial \x will be inter-
1715         preted as a basic  hexadecimal  escape,  with  no  following
1716         digits, giving a byte whose value is zero.
1717    
1718         Characters whose value is less than 256 can  be  defined  by
1719         either  of  the  two  syntaxes  for \x when PCRE is in UTF-8
1720         mode. There is no difference in the way  they  are  handled.
1721         For example, \xdc is exactly the same as \x{dc}.
1722    
1723       After "\0" up to two further octal digits are read. In  both       After \0 up to two further octal digits are  read.  In  both
1724       cases,  if  there are fewer than two digits, just those that       cases,  if  there are fewer than two digits, just those that
1725       are present are used. Thus the sequence "\0\x\07"  specifies       are present are used. Thus the  sequence  \0\x\07  specifies
1726       two binary zeros followed by a BEL character.  Make sure you       two binary zeros followed by a BEL character (code value 7).
1727       supply two digits after the initial zero  if  the  character       Make sure you supply two digits after the  initial  zero  if
1728       that follows is itself an octal digit.       the character that follows is itself an octal digit.
1729    
1730       The handling of a backslash followed by a digit other than 0       The handling of a backslash followed by a digit other than 0
1731       is  complicated.   Outside  a character class, PCRE reads it       is  complicated.   Outside  a character class, PCRE reads it
# Line 918  BACKSLASH Line 1751  BACKSLASH
1751                   writing a tab                   writing a tab
1752         \011   is always a tab         \011   is always a tab
1753         \0113  is a tab followed by the character "3"         \0113  is a tab followed by the character "3"
1754         \113   is the character with octal code 113 (since there         \113   might be a back reference, otherwise the
1755                   can be no more than 99 back references)                   character with octal code 113
1756         \377   is a byte consisting entirely of 1 bits         \377   might be a back reference, otherwise
1757                     the byte consisting entirely of 1 bits
1758         \81    is either a back reference, or a binary zero         \81    is either a back reference, or a binary zero
1759                   followed by the two characters "8" and "1"                   followed by the two characters "8" and "1"
1760    
# Line 928  BACKSLASH Line 1762  BACKSLASH
1762       duced  by  a  leading zero, because no more than three octal       duced  by  a  leading zero, because no more than three octal
1763       digits are ever read.       digits are ever read.
1764    
1765       All the sequences that define a single  byte  value  can  be       All the sequences that define a single byte value or a  sin-
1766       used both inside and outside character classes. In addition,       gle  UTF-8 character (in UTF-8 mode) can be used both inside
1767       inside a character class, the sequence "\b"  is  interpreted       and outside character classes. In addition, inside a charac-
1768       as  the  backspace  character  (hex 08). Outside a character       ter  class,  the sequence \b is interpreted as the backspace
1769       class it has a different meaning (see below).       character (hex 08). Outside a character class it has a  dif-
1770         ferent meaning (see below).
1771    
1772       The third use of backslash is for specifying generic charac-       The third use of backslash is for specifying generic charac-
1773       ter types:       ter types:
# Line 942  BACKSLASH Line 1777  BACKSLASH
1777         \s     any whitespace character         \s     any whitespace character
1778         \S     any character that is not a whitespace character         \S     any character that is not a whitespace character
1779         \w     any "word" character         \w     any "word" character
1780         \W     any "non-word" character         W     any "non-word" character
1781    
1782       Each pair of escape sequences partitions the complete set of       Each pair of escape sequences partitions the complete set of
1783       characters  into  two  disjoint  sets.  Any  given character       characters  into  two  disjoint  sets.  Any  given character
1784       matches one, and only one, of each pair.       matches one, and only one, of each pair.
1785    
1786         In UTF-8 mode, characters with values greater than 255 never
1787         match \d, \s, or \w, and always match \D, \S, and \W.
1788    
1789         For compatibility with Perl, \s does not match the VT  char-
1790         acter (code 11).  This makes it different from the the POSIX
1791         "space" class. The \s characters are HT  (9),  LF  (10),  FF
1792         (12), CR (13), and space (32).
1793    
1794       A "word" character is any letter or digit or the  underscore       A "word" character is any letter or digit or the  underscore
1795       character,  that  is,  any  character which can be part of a       character,  that  is,  any  character which can be part of a
1796       Perl "word". The definition of letters and  digits  is  con-       Perl "word". The definition of letters and  digits  is  con-
1797       trolled  by PCRE's character tables, and may vary if locale-       trolled  by PCRE's character tables, and may vary if locale-
1798       specific matching is  taking  place  (see  "Locale  support"       specific matching is taking place (see "Locale  support"  in
1799       above). For example, in the "fr" (French) locale, some char-       the pcreapi page). For example, in the "fr" (French) locale,
1800       acter codes greater than 128 are used for accented  letters,       some character codes greater than 128 are used for  accented
1801       and these are matched by \w.       letters, and these are matched by \w.
1802    
1803       These character type sequences can appear  both  inside  and       These character type sequences can appear  both  inside  and
1804       outside  character classes. They each match one character of       outside  character classes. They each match one character of
# Line 970  BACKSLASH Line 1813  BACKSLASH
1813       for more complicated  assertions  is  described  below.  The       for more complicated  assertions  is  described  below.  The
1814       backslashed assertions are       backslashed assertions are
1815    
1816         \b     word boundary         \b     matches at a word boundary
1817         \B     not a word boundary         \B     matches when not at a word boundary
1818         \A     start of subject (independent of multiline mode)         \A     matches at start of subject
1819         \Z     end of subject or newline at  end  (independent  of         \Z     matches at end of subject or before newline at end
1820       multiline mode)         \z     matches at end of subject
1821         \z     end of subject (independent of multiline mode)         \G     matches at first matching position in subject
1822    
1823       These assertions may not appear in  character  classes  (but       These assertions may not appear in  character  classes  (but
1824       note that "\b" has a different meaning, namely the backspace       note  that  \b has a different meaning, namely the backspace
1825       character, inside a character class).       character, inside a character class).
1826    
1827       A word boundary is a position in the  subject  string  where       A word boundary is a position in the  subject  string  where
# Line 986  BACKSLASH Line 1829  BACKSLASH
1829       match \w or \W (i.e. one matches \w and  the  other  matches       match \w or \W (i.e. one matches \w and  the  other  matches
1830       \W),  or the start or end of the string if the first or last       \W),  or the start or end of the string if the first or last
1831       character matches \w, respectively.       character matches \w, respectively.
   
1832       The \A, \Z, and \z assertions differ  from  the  traditional       The \A, \Z, and \z assertions differ  from  the  traditional
1833       circumflex  and  dollar  (described below) in that they only       circumflex  and  dollar  (described below) in that they only
1834       ever match at the very start and end of the subject  string,       ever match at the very start and end of the subject  string,
1835       whatever  options  are  set.  They  are  not affected by the       whatever options are set. Thus, they are independent of mul-
1836       PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-       tiline mode.
1837       ment  of  pcre_exec()  is  non-zero, \A can never match. The  
1838         They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL
1839         options.  If the startoffset argument of pcre_exec() is non-
1840         zero, indicating that matching is to start at a point  other
1841         than  the  beginning of the subject, \A can never match. The
1842       difference between \Z and \z is that  \Z  matches  before  a       difference between \Z and \z is that  \Z  matches  before  a
1843       newline  that is the last character of the string as well as       newline  that is the last character of the string as well as
1844       at the end of the string, whereas \z  matches  only  at  the       at the end of the string, whereas \z  matches  only  at  the
1845       end.       end.
1846    
1847         The \G assertion is true  only  when  the  current  matching
1848         position is at the start point of the match, as specified by
1849         the startoffset argument of pcre_exec(). It differs from  \A
1850         when  the  value  of  startoffset  is  non-zero.  By calling
1851         pcre_exec() multiple times with appropriate  arguments,  you
1852         can mimic Perl's /g option, and it is in this kind of imple-
1853         mentation where \G can be useful.
1854    
1855         Note, however, that PCRE's  interpretation  of  \G,  as  the
1856         start of the current match, is subtly different from Perl's,
1857         which defines it as the end of the previous match. In  Perl,
1858         these  can  be  different when the previously matched string
1859         was empty. Because PCRE does just one match at  a  time,  it
1860         cannot reproduce this behaviour.
1861    
1862         If all the alternatives of a  pattern  begin  with  \G,  the
1863         expression  is  anchored to the starting match position, and
1864         the "anchored" flag is set in the compiled  regular  expres-
1865         sion.
1866    
1867    
1868  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
1869    
1870       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1871       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1872       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1873       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1874       zero, circumflex can never match. Inside a character  class,       zero, circumflex  can  never  match  if  the  PCRE_MULTILINE
1875       circumflex has an entirely different meaning (see below).       option is unset. Inside a character class, circumflex has an
1876         entirely different meaning (see below).
1877    
1878       Circumflex need not be the first character of the pattern if       Circumflex need not be the first character of the pattern if
1879       a  number of alternatives are involved, but it should be the       a  number of alternatives are involved, but it should be the
# Line 1028  CIRCUMFLEX AND DOLLAR Line 1895  CIRCUMFLEX AND DOLLAR
1895    
1896       The meaning of dollar can be changed so that it matches only       The meaning of dollar can be changed so that it matches only
1897       at   the   very   end   of   the   string,  by  setting  the       at   the   very   end   of   the   string,  by  setting  the
1898       PCRE_DOLLAR_ENDONLY option at compile or matching time. This       PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not
1899       does not affect the \Z assertion.       affect the \Z assertion.
1900    
1901       The meanings of the circumflex  and  dollar  characters  are       The meanings of the circumflex  and  dollar  characters  are
1902       changed  if  the  PCRE_MULTILINE option is set. When this is       changed  if  the  PCRE_MULTILINE option is set. When this is
1903       the case,  they  match  immediately  after  and  immediately       the case,  they  match  immediately  after  and  immediately
1904       before an internal "\n" character, respectively, in addition       before an internal newline character, respectively, in addi-
1905       to matching at the start and end of the subject string.  For       tion to matching at the start and end of the subject string.
1906       example,  the  pattern  /^abc$/  matches  the subject string       For  example, the pattern /^abc$/ matches the subject string
1907       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
1908       quently,  patterns  that  are  anchored  in single line mode       quently,  patterns  that  are  anchored  in single line mode
1909       because all branches start with "^" are not anchored in mul-       because all branches start with ^ are not anchored in multi-
1910       tiline mode, and a match for circumflex is possible when the       line  mode,  and a match for circumflex is possible when the
1911       startoffset  argument  of  pcre_exec()  is   non-zero.   The       startoffset  argument  of  pcre_exec()  is   non-zero.   The
1912       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is
1913       set.       set.
1914    
1915       Note that the sequences \A, \Z, and \z can be used to  match       Note that the sequences \A, \Z, and \z can be used to  match
1916       the  start  and end of the subject in both modes, and if all       the  start  and end of the subject in both modes, and if all
1917       branches of a pattern start with \A is it  always  anchored,       branches of a pattern start with \A it is  always  anchored,
1918       whether PCRE_MULTILINE is set or not.       whether PCRE_MULTILINE is set or not.
1919    
1920    
   
1921  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
1922    
1923       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1924       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1925       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default) newline.  In UTF-8 mode,  a  dot
1926       option  is set, dots match newlines as well. The handling of       matches  any  UTF-8  character, which might be more than one
1927       dot is entirely independent of the  handling  of  circumflex       byte  long,  except  (by  default)  for  newline.   If   the
1928       and  dollar,  the  only  relationship  being  that they both       PCRE_DOTALL  option is set, dots match newlines as well. The
1929       involve newline characters. Dot has no special meaning in  a       handling of dot is entirely independent of the  handling  of
1930       character class.       circumflex and dollar, the only relationship being that they
1931         both involve newline characters. Dot has no special  meaning
1932         in a character class.
1933    
1934    
1935    
1936    MATCHING A SINGLE BYTE
1937    
1938         Outside a character class, the escape  sequence  \C  matches
1939         any  one  byte, both in and out of UTF-8 mode. Unlike a dot,
1940         it always matches a newline. The feature is provided in Perl
1941         in  order  to match individual bytes in UTF-8 mode.  Because
1942         it breaks up UTF-8 characters into  individual  bytes,  what
1943         remains  in  the string may be a malformed UTF-8 string. For
1944         this reason it is best avoided.
1945    
1946         PCRE does not allow \C to appear  in  lookbehind  assertions
1947         (see below), because in UTF-8 mode it makes it impossible to
1948         calculate the length of the lookbehind.
1949    
1950    
1951  SQUARE BRACKETS  SQUARE BRACKETS
1952    
1953       An opening square bracket introduces a character class, ter-       An opening square bracket introduces a character class, ter-
1954       minated  by  a  closing  square  bracket.  A  closing square       minated  by  a  closing  square  bracket.  A  closing square
1955       bracket on its own is  not  special.  If  a  closing  square       bracket on its own is  not  special.  If  a  closing  square
# Line 1072  SQUARE BRACKETS Line 1957  SQUARE BRACKETS
1957       the first data character in the class (after an initial cir-       the first data character in the class (after an initial cir-
1958       cumflex, if present) or escaped with a backslash.       cumflex, if present) or escaped with a backslash.
1959    
1960       A character class matches a single character in the subject;       A character class matches a single character in the subject.
1961       the  character  must  be in the set of characters defined by       In  UTF-8 mode, the character may occupy more than one byte.
1962       the class, unless the first character in the class is a cir-       A matched character must be in the set of characters defined
1963       cumflex,  in which case the subject character must not be in       by the class, unless the first character in the class defin-
1964       the set defined by the class. If a  circumflex  is  actually       ition is a circumflex, in which case the  subject  character
1965       required  as  a  member  of  the class, ensure it is not the       must not be in the set defined by the class. If a circumflex
1966       first character, or escape it with a backslash.       is actually required as a member of the class, ensure it  is
1967         not the first character, or escape it with a backslash.
1968    
1969       For example, the character class [aeiou] matches  any  lower       For example, the character class [aeiou] matches  any  lower
1970       case vowel, while [^aeiou] matches any character that is not       case vowel, while [^aeiou] matches any character that is not
# Line 1089  SQUARE BRACKETS Line 1975  SQUARE BRACKETS
1975       string, and fails if the current pointer is at  the  end  of       string, and fails if the current pointer is at  the  end  of
1976       the string.       the string.
1977    
1978         In UTF-8 mode, characters with values greater than  255  can
1979         be  included  in a class as a literal string of bytes, or by
1980         using the \x{ escaping mechanism.
1981    
1982       When caseless matching  is  set,  any  letters  in  a  class       When caseless matching  is  set,  any  letters  in  a  class
1983       represent  both their upper case and lower case versions, so       represent  both their upper case and lower case versions, so
1984       for example, a caseless [aeiou] matches "A" as well as  "a",       for example, a caseless [aeiou] matches "A" as well as  "a",
1985       and  a caseless [^aeiou] does not match "A", whereas a case-       and  a caseless [^aeiou] does not match "A", whereas a case-
1986       ful version would.       ful version would. PCRE does not support the concept of case
1987         for characters with values greater than 255.
1988       The newline character is never treated in any special way in       The newline character is never treated in any special way in
1989       character  classes,  whatever the setting of the PCRE_DOTALL       character  classes,  whatever the setting of the PCRE_DOTALL
1990       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will
# Line 1118  SQUARE BRACKETS Line 2008  SQUARE BRACKETS
2008       separate characters. The octal or hexadecimal representation       separate characters. The octal or hexadecimal representation
2009       of "]" can also be used to end a range.       of "]" can also be used to end a range.
2010    
2011       Ranges operate in ASCII collating sequence. They can also be       Ranges  operate  in  the  collating  sequence  of  character
2012       used  for  characters  specified  numerically,  for  example       values.  They  can  also  be  used  for characters specified
2013       [\000-\037]. If a range that includes letters is  used  when       numerically, for example [\000-\037]. In UTF-8 mode,  ranges
2014       caseless  matching  is set, it matches the letters in either       can  include  characters  whose values are greater than 255,
2015       case. For example, [W-c] is equivalent  to  [][\^_`wxyzabc],       for example [\x{100}-\x{2ff}].
2016       matched  caselessly,  and  if  character tables for the "fr"  
2017       locale are in use, [\xc8-\xcb] matches accented E characters       If a range that  includes  letters  is  used  when  caseless
2018       in both cases.       matching  is set, it matches the letters in either case. For
2019         example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched
2020         caselessly,  and if character tables for the "fr" locale are
2021         in use, [\xc8-\xcb] matches accented E  characters  in  both
2022         cases.
2023    
2024       The character types \d, \D, \s, \S,  \w,  and  \W  may  also       The character types \d, \D, \s, \S,  \w,  and  \W  may  also
2025       appear  in  a  character  class, and add the characters that       appear  in  a  character  class, and add the characters that
# Line 1141  SQUARE BRACKETS Line 2035  SQUARE BRACKETS
2035       classes, but it does no harm if they are escaped.       classes, but it does no harm if they are escaped.
2036    
2037    
   
2038  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
2039       Perl 5.6 (not yet released at the time of writing) is  going  
2040       to  support  the POSIX notation for character classes, which       Perl supports the  POSIX  notation  for  character  classes,
2041       uses names enclosed by  [:  and  :]   within  the  enclosing       which  uses names enclosed by [: and :] within the enclosing
2042       square brackets. PCRE supports this notation. For example,       square brackets. PCRE also supports this notation. For exam-
2043         ple,
2044    
2045         [01[:alpha:]%]         [01[:alpha:]%]
2046    
# Line 1156  POSIX CHARACTER CLASSES Line 2050  POSIX CHARACTER CLASSES
2050         alnum    letters and digits         alnum    letters and digits
2051         alpha    letters         alpha    letters
2052         ascii    character codes 0 - 127         ascii    character codes 0 - 127
2053           blank    space or tab only
2054         cntrl    control characters         cntrl    control characters
2055         digit    decimal digits (same as \d)         digit    decimal digits (same as \d)
2056         graph    printing characters, excluding space         graph    printing characters, excluding space
2057         lower    lower case letters         lower    lower case letters
2058         print    printing characters, including space         print    printing characters, including space
2059         punct    printing characters, excluding letters and digits         punct    printing characters, excluding letters and digits
2060         space    white space (same as \s)         space    white space (not quite the same as \s)
2061         upper    upper case letters         upper    upper case letters
2062         word     "word" characters (same as \w)         word     "word" characters (same as \w)
2063         xdigit   hexadecimal digits         xdigit   hexadecimal digits
2064    
2065       The names "ascii" and "word" are  Perl  extensions.  Another       The "space" characters are HT (9),  LF  (10),  VT  (11),  FF
2066       Perl  extension is negation, which is indicated by a ^ char-       (12),  CR  (13),  and  space  (32).  Notice  that  this list
2067       acter after the colon. For example,       includes the VT character (code 11). This makes "space" dif-
2068         ferent  to  \s, which does not include VT (for Perl compati-
2069         bility).
2070    
2071         The name "word" is a Perl extension, and "blank"  is  a  GNU
2072         extension from Perl 5.8. Another Perl extension is negation,
2073         which is indicated by a ^ character  after  the  colon.  For
2074         example,
2075    
2076         [12[:^digit:]]         [12[:^digit:]]
2077    
2078       matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also       matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also
2079       recogize  the POSIX syntax [.ch.] and [=ch=] where "ch" is a       recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2080       "collating element", but these are  not  supported,  and  an       "collating element", but these are  not  supported,  and  an
2081       error is given if they are encountered.       error is given if they are encountered.
2082    
2083         In UTF-8 mode, characters with values greater  than  255  do
2084         not match any of the POSIX character classes.
2085    
2086    
2087  VERTICAL BAR  VERTICAL BAR
2088    
2089       Vertical bar characters are  used  to  separate  alternative       Vertical bar characters are  used  to  separate  alternative
2090       patterns. For example, the pattern       patterns. For example, the pattern
2091    
# Line 1196  VERTICAL BAR Line 2101  VERTICAL BAR
2101       subpattern.       subpattern.
2102    
2103    
   
2104  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
2105       The settings of PCRE_CASELESS, PCRE_MULTILINE,  PCRE_DOTALL,  
2106       and  PCRE_EXTENDED can be changed from within the pattern by       The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,
2107       a sequence of Perl option letters enclosed between "(?"  and       PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from
2108       ")". The option letters are       within the pattern by a  sequence  of  Perl  option  letters
2109         enclosed between "(?" and ")". The option letters are
2110    
2111         i  for PCRE_CASELESS         i  for PCRE_CASELESS
2112         m  for PCRE_MULTILINE         m  for PCRE_MULTILINE
# Line 1216  INTERNAL OPTION SETTING Line 2121  INTERNAL OPTION SETTING
2121       If  a  letter  appears both before and after the hyphen, the       If  a  letter  appears both before and after the hyphen, the
2122       option is unset.       option is unset.
2123    
2124       The scope of these option changes depends on  where  in  the       When an option change occurs at  top  level  (that  is,  not
2125       pattern  the  setting  occurs. For settings that are outside       inside  subpattern  parentheses),  the change applies to the
2126       any subpattern (defined below), the effect is the same as if       remainder of the pattern that follows.   If  the  change  is
2127       the  options were set or unset at the start of matching. The       placed  right  at  the  start of a pattern, PCRE extracts it
2128       following patterns all behave in exactly the same way:       into the global options (and it will therefore  show  up  in
2129         data extracted by the pcre_fullinfo() function).
2130         (?i)abc  
2131         a(?i)bc       An option change within a subpattern affects only that  part
2132         ab(?i)c       of the current pattern that follows it, so
        abc(?i)  
   
      which in turn is the same as compiling the pattern abc  with  
      PCRE_CASELESS  set.   In  other words, such "top level" set-  
      tings apply to the whole pattern  (unless  there  are  other  
      changes  inside subpatterns). If there is more than one set-  
      ting of the same option at top level, the rightmost  setting  
      is used.  
   
      If an option change occurs inside a subpattern,  the  effect  
      is  different.  This is a change of behaviour in Perl 5.005.  
      An option change inside a subpattern affects only that  part  
      of the subpattern that follows it, so  
2133    
2134         (a(?i)b)c         (a(?i)b)c
2135    
# Line 1264  INTERNAL OPTION SETTING Line 2156  INTERNAL OPTION SETTING
2156       even when it is at top level. It is best put at the start.       even when it is at top level. It is best put at the start.
2157    
2158    
   
2159  SUBPATTERNS  SUBPATTERNS
2160    
2161       Subpatterns are delimited by parentheses  (round  brackets),       Subpatterns are delimited by parentheses  (round  brackets),
2162       which can be nested.  Marking part of a pattern as a subpat-       which can be nested.  Marking part of a pattern as a subpat-
2163       tern does two things:       tern does two things:
# Line 1293  SUBPATTERNS Line 2185  SUBPATTERNS
2185         the ((red|white) (king|queen))         the ((red|white) (king|queen))
2186    
2187       the captured substrings are "red king", "red",  and  "king",       the captured substrings are "red king", "red",  and  "king",
2188       and are numbered 1, 2, and 3.       and are numbered 1, 2, and 3, respectively.
2189    
2190       The fact that plain parentheses fulfil two functions is  not       The fact that plain parentheses fulfil two functions is  not
2191       always  helpful.  There are often times when a grouping sub-       always  helpful.  There are often times when a grouping sub-
2192       pattern is required without a capturing requirement.  If  an       pattern is required without a capturing requirement.  If  an
2193       opening parenthesis is followed by "?:", the subpattern does       opening  parenthesis  is  followed  by a question mark and a
2194       not do any capturing, and is not counted when computing  the       colon, the subpattern does not do any capturing, and is  not
2195       number of any subsequent capturing subpatterns. For example,       counted  when computing the number of any subsequent captur-
2196       if the string "the white queen" is matched against the  pat-       ing subpatterns. For  example,  if  the  string  "the  white
2197       tern       queen" is matched against the pattern
2198    
2199         the ((?:red|white) (king|queen))         the ((?:red|white) (king|queen))
2200    
2201       the captured substrings are "white queen" and  "queen",  and       the captured substrings are "white queen" and  "queen",  and
2202       are  numbered  1  and 2. The maximum number of captured sub-       are  numbered  1 and 2. The maximum number of capturing sub-
2203       strings is 99, and the maximum number  of  all  subpatterns,       patterns is 65535, and the maximum depth of nesting  of  all
2204       both capturing and non-capturing, is 200.       subpatterns, both capturing and non-capturing, is 200.
2205    
2206       As a  convenient  shorthand,  if  any  option  settings  are       As a  convenient  shorthand,  if  any  option  settings  are
2207       required  at  the  start  of a non-capturing subpattern, the       required  at  the  start  of a non-capturing subpattern, the
# Line 1326  SUBPATTERNS Line 2218  SUBPATTERNS
2218       the above patterns match "SUNDAY" as well as "Saturday".       the above patterns match "SUNDAY" as well as "Saturday".
2219    
2220    
2221    NAMED SUBPATTERNS
2222    
2223         Identifying capturing parentheses by number is  simple,  but
2224         it  can be very hard to keep track of the numbers in compli-
2225         cated regular expressions. Furthermore, if an expression  is
2226         modified,  the  numbers  may change. To help with the diffi-
2227         culty, PCRE supports the naming  of  subpatterns,  something
2228         that  Perl does not provide. The Python syntax (?P<name>...)
2229         is used. Names consist of alphanumeric characters and under-
2230         scores, and must be unique within a pattern.
2231    
2232         Named capturing parentheses are still allocated  numbers  as
2233         well  as  names.  The  PCRE  API provides function calls for
2234         extracting the name-to-number translation table from a  com-
2235         piled  pattern. For further details see the pcreapi documen-
2236         tation.
2237    
2238    
2239  REPETITION  REPETITION
2240    
2241       Repetition is specified by quantifiers, which can follow any       Repetition is specified by quantifiers, which can follow any
2242       of the following items:       of the following items:
2243    
2244         a single character, possibly escaped         a literal data character
2245         the . metacharacter         the . metacharacter
2246           the \C escape sequence
2247           escapes such as \d that match single characters
2248         a character class         a character class
2249         a back reference (see next section)         a back reference (see next section)
2250         a parenthesized subpattern (unless it is  an  assertion  -         a parenthesized subpattern (unless it is an assertion)
      see below)  
2251    
2252       The general repetition quantifier specifies  a  minimum  and       The general repetition quantifier specifies  a  minimum  and
2253       maximum  number  of  permitted  matches,  by  giving the two       maximum  number  of  permitted  matches,  by  giving the two
# Line 1365  REPETITION Line 2276  REPETITION
2276       as  a literal character. For example, {,6} is not a quantif-       as  a literal character. For example, {,6} is not a quantif-
2277       ier, but a literal string of four characters.       ier, but a literal string of four characters.
2278    
2279         In UTF-8 mode, quantifiers apply to UTF-8 characters  rather
2280         than  to  individual  bytes.  Thus,  for example, \x{100}{2}
2281         matches two UTF-8 characters, each of which  is  represented
2282         by a two-byte sequence.
2283    
2284       The quantifier {0} is permitted, causing the  expression  to       The quantifier {0} is permitted, causing the  expression  to
2285       behave  as  if the previous item and the quantifier were not       behave  as  if the previous item and the quantifier were not
2286       present.       present.
# Line 1403  REPETITION Line 2319  REPETITION
2319    
2320         /* first command */  not comment  /* second comment */         /* first command */  not comment  /* second comment */
2321    
2322       fails, because it matches  the  entire  string  due  to  the       fails, because it matches the entire  string  owing  to  the
2323       greediness of the .*  item.       greediness of the .*  item.
2324    
2325       However, if a quantifier is followed by a question mark,  it       However, if a quantifier is followed by a question mark,  it
# Line 1434  REPETITION Line 2350  REPETITION
2350       repeat  count  that is greater than 1 or with a limited max-       repeat  count  that is greater than 1 or with a limited max-
2351       imum, more store is required for the  compiled  pattern,  in       imum, more store is required for the  compiled  pattern,  in
2352       proportion to the size of the minimum or maximum.       proportion to the size of the minimum or maximum.
   
2353       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
2354       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
2355       to match  newlines,  the  pattern  is  implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
2356       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
2357       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
2358       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
2359       PCRE treats such a pattern as though it were preceded by \A.       PCRE normally treats such a pattern as though it  were  pre-
2360       In  cases where it is known that the subject string contains       ceded by \A.
2361       no newlines, it is worth setting PCRE_DOTALL when  the  pat-  
2362       tern begins with .* in order to obtain this optimization, or       In cases where it is known that the subject string  contains
2363       alternatively using ^ to indicate anchoring explicitly.       no  newlines,  it  is  worth setting PCRE_DOTALL in order to
2364         obtain this optimization, or alternatively using ^ to  indi-
2365         cate anchoring explicitly.
2366    
2367         However, there is one situation where the optimization  can-
2368         not  be  used. When .*  is inside capturing parentheses that
2369         are the subject of a backreference elsewhere in the pattern,
2370         a match at the start may fail, and a later one succeed. Con-
2371         sider, for example:
2372    
2373           (.*)abc\1
2374    
2375         If the subject is "xyz123abc123"  the  match  point  is  the
2376         fourth  character.  For  this  reason, such a pattern is not
2377         implicitly anchored.
2378    
2379       When a capturing subpattern is repeated, the value  captured       When a capturing subpattern is repeated, the value  captured
2380       is the substring that matched the final iteration. For exam-       is the substring that matched the final iteration. For exam-
# Line 1465  REPETITION Line 2394  REPETITION
2394       "b".       "b".
2395    
2396    
2397    ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2398    
2399         With both maximizing and minimizing repetition,  failure  of
2400         what  follows  normally  causes  the repeated item to be re-
2401         evaluated to see if a different number of repeats allows the
2402         rest  of  the  pattern  to  match. Sometimes it is useful to
2403         prevent this, either to change the nature of the  match,  or
2404         to  cause  it fail earlier than it otherwise might, when the
2405         author of the pattern knows there is no  point  in  carrying
2406         on.
2407    
2408         Consider, for example, the pattern \d+foo  when  applied  to
2409         the subject line
2410    
2411           123456bar
2412    
2413         After matching all 6 digits and then failing to match "foo",
2414         the normal action of the matcher is to try again with only 5
2415         digits matching the \d+ item, and then with 4,  and  so  on,
2416         before  ultimately  failing. "Atomic grouping" (a term taken
2417         from Jeffrey Friedl's book) provides the means for  specify-
2418         ing  that once a subpattern has matched, it is not to be re-
2419         evaluated in this way.
2420    
2421         If we use atomic grouping  for  the  previous  example,  the
2422         matcher  would give up immediately on failing to match "foo"
2423         the  first  time.  The  notation  is  a  kind   of   special
2424         parenthesis, starting with (?> as in this example:
2425    
2426           (?>\d+)bar
2427    
2428         This kind of parenthesis "locks up" the  part of the pattern
2429         it  contains once it has matched, and a failure further into
2430         the pattern is prevented from backtracking  into  it.  Back-
2431         tracking  past  it to previous items, however, works as nor-
2432         mal.
2433    
2434         An alternative description is that a subpattern of this type
2435         matches  the  string  of  characters that an identical stan-
2436         dalone pattern would match, if anchored at the current point
2437         in the subject string.
2438    
2439         Atomic grouping subpatterns are not  capturing  subpatterns.
2440         Simple  cases such as the above example can be thought of as
2441         a maximizing repeat that must swallow everything it can. So,
2442         while both \d+ and \d+? are prepared to adjust the number of
2443         digits they match in order to make the rest of  the  pattern
2444         match, (?>\d+) can only match an entire sequence of digits.
2445    
2446         Atomic groups in general can of course  contain  arbitrarily
2447         complicated  subpatterns,  and  can be nested. However, when
2448         the subpattern for an atomic group is just a single repeated
2449         item,  as in the example above, a simpler notation, called a
2450         "possessive quantifier" can be used.  This  consists  of  an
2451         additional  +  character  following a quantifier. Using this
2452         notation, the previous example can be rewritten as
2453    
2454           \d++bar
2455    
2456         Possessive quantifiers are always greedy; the setting of the
2457         PCRE_UNGREEDY option is ignored. They are a convenient nota-
2458         tion for the simpler forms of atomic group.  However,  there
2459         is  no  difference in the meaning or processing of a posses-
2460         sive quantifier and the equivalent atomic group.
2461    
2462         The possessive quantifier syntax is an extension to the Perl
2463         syntax. It originates in Sun's Java package.
2464    
2465         When a pattern contains an unlimited repeat inside a subpat-
2466         tern  that  can  itself  be  repeated an unlimited number of
2467         times, the use of an atomic group is the only way  to  avoid
2468         some  failing  matches  taking  a very long time indeed. The
2469         pattern
2470    
2471           (\D+|<\d+>)*[!?]
2472    
2473         matches an unlimited number of substrings that  either  con-
2474         sist  of  non-digits,  or digits enclosed in <>, followed by
2475         either ! or ?. When it matches, it runs quickly. However, if
2476         it is applied to
2477    
2478           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2479    
2480         it takes a long  time  before  reporting  failure.  This  is
2481         because the string can be divided between the two repeats in
2482         a large number of ways, and all have to be tried. (The exam-
2483         ple  used  [!?]  rather  than a single character at the end,
2484         because both PCRE and Perl have an optimization that  allows
2485         for  fast  failure  when  a  single  character is used. They
2486         remember the last single character that is  required  for  a
2487         match,  and  fail early if it is not present in the string.)
2488         If the pattern is changed to
2489    
2490           ((?>\D+)|<\d+>)*[!?]
2491    
2492         sequences of non-digits cannot be broken, and  failure  hap-
2493         pens quickly.
2494    
2495    
2496  BACK REFERENCES  BACK REFERENCES
2497    
2498       Outside a character class, a backslash followed by  a  digit       Outside a character class, a backslash followed by  a  digit
2499       greater  than  0  (and  possibly  further  digits) is a back       greater  than  0  (and  possibly  further  digits) is a back
2500       reference to a capturing subpattern  earlier  (i.e.  to  its       reference to a capturing subpattern earlier (that is, to its
2501       left)  in  the  pattern,  provided there have been that many       left)  in  the  pattern,  provided there have been that many
2502       previous capturing left parentheses.       previous capturing left parentheses.
2503    
# Line 1484  BACK REFERENCES Line 2512  BACK REFERENCES
2512    
2513       A back reference matches whatever actually matched the  cap-       A back reference matches whatever actually matched the  cap-
2514       turing subpattern in the current subject string, rather than       turing subpattern in the current subject string, rather than
2515       anything matching the subpattern itself. So the pattern       anything matching the subpattern itself (see "Subpatterns as
2516         subroutines" below for a way of doing that). So the pattern
2517    
2518         (sens|respons)e and \1ibility         (sens|respons)e and \1ibility
2519    
# Line 1499  BACK REFERENCES Line 2528  BACK REFERENCES
2528       though  the  original  capturing subpattern is matched case-       though  the  original  capturing subpattern is matched case-
2529       lessly.       lessly.
2530    
2531         Back references to named subpatterns use the  Python  syntax
2532         (?P=name). We could rewrite the above example as follows:
2533    
2534           (?<p1>(?i)rah)\s+(?P=p1)
2535    
2536       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
2537       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
2538       particular match, any back references to it always fail. For       particular match, any back references to it always fail. For
# Line 1507  BACK REFERENCES Line 2541  BACK REFERENCES
2541         (a|(bc))\2         (a|(bc))\2
2542    
2543       always fails if it starts to match  "a"  rather  than  "bc".       always fails if it starts to match  "a"  rather  than  "bc".
2544       Because  there  may  be up to 99 back references, all digits       Because  there  may  be many capturing parentheses in a pat-
2545       following the backslash are taken as  part  of  a  potential       tern, all digits following the backslash are taken  as  part
2546       back reference number. If the pattern continues with a digit       of a potential back reference number. If the pattern contin-
2547       character, some delimiter must be used to terminate the back       ues with a digit character, some delimiter must be  used  to
2548       reference.   If the PCRE_EXTENDED option is set, this can be       terminate the back reference. If the PCRE_EXTENDED option is
2549       whitespace. Otherwise an empty comment can be used.       set, this can be whitespace.  Otherwise an empty comment can
2550         be used.
2551    
2552       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
2553       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
2554       example, (a\1) never matches.  However, such references  can       example, (a\1) never matches.  However, such references  can
2555       be  useful  inside  repeated  subpatterns.  For example, the       be useful inside repeated subpatterns. For example, the pat-
2556       pattern       tern
2557    
2558         (a|b\1)+         (a|b\1)+
2559    
2560       matches any number of "a"s and also "aba", "ababaa" etc.  At       matches any number of "a"s and also "aba", "ababbaa" etc. At
2561       each iteration of the subpattern, the back reference matches       each iteration of the subpattern, the back reference matches
2562       the character string corresponding to  the  previous  itera-       the character string corresponding to  the  previous  itera-
2563       tion.  In  order  for this to work, the pattern must be such       tion.  In  order  for this to work, the pattern must be such
# Line 1531  BACK REFERENCES Line 2566  BACK REFERENCES
2566       example above, or by a quantifier with a minimum of zero.       example above, or by a quantifier with a minimum of zero.
2567    
2568    
   
2569  ASSERTIONS  ASSERTIONS
2570    
2571       An assertion is  a  test  on  the  characters  following  or       An assertion is  a  test  on  the  characters  following  or
2572       preceding  the current matching point that does not actually       preceding  the current matching point that does not actually
2573       consume any characters. The simple assertions coded  as  \b,       consume any characters. The simple assertions coded  as  \b,
2574       \B,  \A,  \Z,  \z, ^ and $ are described above. More compli-       \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-
2575       cated assertions are coded as  subpatterns.  There  are  two       plicated assertions are coded as subpatterns. There are  two
2576       kinds:  those that look ahead of the current position in the       kinds:  those that look ahead of the current position in the
2577       subject string, and those that look behind it.       subject string, and those that look behind it.
2578    
# Line 1564  ASSERTIONS Line 2599  ASSERTIONS
2599       when  the  next  three  characters  are  "bar". A lookbehind       when  the  next  three  characters  are  "bar". A lookbehind
2600       assertion is needed to achieve this effect.       assertion is needed to achieve this effect.
2601    
2602         If you want to force a matching failure at some point  in  a
2603         pattern,  the  most  convenient  way  to  do it is with (?!)
2604         because an empty string always matches, so an assertion that
2605         requires there not to be an empty string must always fail.
2606    
2607       Lookbehind assertions start with (?<=  for  positive  asser-       Lookbehind assertions start with (?<=  for  positive  asser-
2608       tions and (?<! for negative assertions. For example,       tions and (?<! for negative assertions. For example,
2609    
# Line 1584  ASSERTIONS Line 2624  ASSERTIONS
2624       causes an error at compile time. Branches  that  match  dif-       causes an error at compile time. Branches  that  match  dif-
2625       ferent length strings are permitted only at the top level of       ferent length strings are permitted only at the top level of
2626       a lookbehind assertion. This is an extension  compared  with       a lookbehind assertion. This is an extension  compared  with
2627       Perl  5.005,  which  requires all branches to match the same       Perl  (at  least  for  5.8),  which requires all branches to
2628       length of string. An assertion such as       match the same length of string. An assertion such as
2629    
2630         (?<=ab(c|de))         (?<=ab(c|de))
2631    
# Line 1599  ASSERTIONS Line 2639  ASSERTIONS
2639       alternative,  to  temporarily move the current position back       alternative,  to  temporarily move the current position back
2640       by the fixed width and then  try  to  match.  If  there  are       by the fixed width and then  try  to  match.  If  there  are
2641       insufficient  characters  before  the  current position, the       insufficient  characters  before  the  current position, the
2642       match is deemed to fail.  Lookbehinds  in  conjunction  with       match is deemed to fail.
2643       once-only  subpatterns can be particularly useful for match-  
2644       ing at the ends of strings; an example is given at  the  end       PCRE does not allow the \C escape (which  matches  a  single
2645       of the section on once-only subpatterns.       byte  in  UTF-8  mode)  to  appear in lookbehind assertions,
2646         because it makes it impossible to calculate  the  length  of
2647         the lookbehind.
2648    
2649         Atomic groups can be used  in  conjunction  with  lookbehind
2650         assertions  to  specify efficient matching at the end of the
2651         subject string. Consider a simple pattern such as
2652    
2653           abcd$
2654    
2655         when applied to a long string that does not  match.  Because
2656         matching  proceeds  from  left  to right, PCRE will look for
2657         each "a" in the subject and then see if what follows matches
2658         the rest of the pattern. If the pattern is specified as
2659    
2660           ^.*abcd$
2661    
2662         the initial .* matches the entire string at first, but  when
2663         this  fails  (because  there  is no following "a"), it back-
2664         tracks to match all but the last character, then all but the
2665         last  two  characters,  and so on. Once again the search for
2666         "a" covers the entire string, from right to left, so we  are
2667         no better off. However, if the pattern is written as
2668    
2669           ^(?>.*)(?<=abcd)
2670    
2671         or, equivalently,
2672    
2673           ^.*+(?<=abcd)
2674    
2675         there can be no backtracking for the .* item; it  can  match
2676         only  the entire string. The subsequent lookbehind assertion
2677         does a single test on the last four characters. If it fails,
2678         the match fails immediately. For long strings, this approach
2679         makes a significant difference to the processing time.
2680    
2681       Several assertions (of any sort) may  occur  in  succession.       Several assertions (of any sort) may  occur  in  succession.
2682       For example,       For example,
# Line 1647  ASSERTIONS Line 2721  ASSERTIONS
2721       for positive assertions, because it does not make sense  for       for positive assertions, because it does not make sense  for
2722       negative assertions.       negative assertions.
2723    
      Assertions count towards the maximum  of  200  parenthesized  
      subpatterns.  
   
   
   
 ONCE-ONLY SUBPATTERNS  
      With both maximizing and minimizing repetition,  failure  of  
      what  follows  normally  causes  the repeated item to be re-  
      evaluated to see if a different number of repeats allows the  
      rest  of  the  pattern  to  match. Sometimes it is useful to  
      prevent this, either to change the nature of the  match,  or  
      to  cause  it fail earlier than it otherwise might, when the  
      author of the pattern knows there is no  point  in  carrying  
      on.  
   
      Consider, for example, the pattern \d+foo  when  applied  to  
      the subject line  
   
        123456bar  
   
      After matching all 6 digits and then failing to match "foo",  
      the normal action of the matcher is to try again with only 5  
      digits matching the \d+ item, and then with 4,  and  so  on,  
      before ultimately failing. Once-only subpatterns provide the  
      means for specifying that once a portion of the pattern  has  
      matched,  it  is  not to be re-evaluated in this way, so the  
      matcher would give up immediately on failing to match  "foo"  
      the  first  time.  The  notation  is another kind of special  
      parenthesis, starting with (?> as in this example:  
   
        (?>\d+)bar  
   
      This kind of parenthesis "locks up" the  part of the pattern  
      it  contains once it has matched, and a failure further into  
      the pattern is prevented from backtracking  into  it.  Back-  
      tracking  past  it to previous items, however, works as nor-  
      mal.  
   
      An alternative description is that a subpattern of this type  
      matches  the  string  of  characters that an identical stan-  
      dalone pattern would match, if anchored at the current point  
      in the subject string.  
   
      Once-only subpatterns are not capturing subpatterns.  Simple  
      cases  such as the above example can be thought of as a max-  
      imizing repeat that must  swallow  everything  it  can.  So,  
      while both \d+ and \d+? are prepared to adjust the number of  
      digits they match in order to make the rest of  the  pattern  
      match, (?>\d+) can only match an entire sequence of digits.  
   
      This construction can of course contain arbitrarily  compli-  
      cated subpatterns, and it can be nested.  
   
      Once-only subpatterns can be used in conjunction with  look-  
      behind  assertions  to specify efficient matching at the end  
      of the subject string. Consider a simple pattern such as  
   
        abcd$  
   
      when applied to a long string which does not match.  Because  
      matching  proceeds  from  left  to right, PCRE will look for  
      each "a" in the subject and then see if what follows matches  
      the rest of the pattern. If the pattern is specified as  
   
        ^.*abcd$  
   
      the initial .* matches the entire string at first, but  when  
      this  fails  (because  there  is no following "a"), it back-  
      tracks to match all but the last character, then all but the  
      last  two  characters,  and so on. Once again the search for  
      "a" covers the entire string, from right to left, so we  are  
      no better off. However, if the pattern is written as  
   
        ^(?>.*)(?<=abcd)  
   
      there can be no backtracking for the .* item; it  can  match  
      only  the entire string. The subsequent lookbehind assertion  
      does a single test on the last four characters. If it fails,  
      the match fails immediately. For long strings, this approach  
      makes a significant difference to the processing time.  
   
      When a pattern contains an unlimited repeat inside a subpat-  
      tern  that  can  itself  be  repeated an unlimited number of  
      times, the use of a once-only subpattern is the only way  to  
      avoid  some  failing matches taking a very long time indeed.  
      The pattern  
   
        (\D+|<\d+>)*[!?]  
   
      matches an unlimited number of substrings that  either  con-  
      sist  of  non-digits,  or digits enclosed in <>, followed by  
      either ! or ?. When it matches, it runs quickly. However, if  
      it is applied to  
   
        aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  
   
      it takes a long  time  before  reporting  failure.  This  is  
      because the string can be divided between the two repeats in  
      a large number of ways, and all have to be tried. (The exam-  
      ple  used  [!?]  rather  than a single character at the end,  
      because both PCRE and Perl have an optimization that  allows  
      for  fast  failure  when  a  single  character is used. They  
      remember the last single character that is  required  for  a  
      match,  and  fail early if it is not present in the string.)  
      If the pattern is changed to  
   
        ((?>\D+)|<\d+>)*[!?]  
   
      sequences of non-digits cannot be broken, and  failure  hap-  
      pens quickly.  
   
   
2724    
2725  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
2726    
2727       It is possible to cause the matching process to obey a  sub-       It is possible to cause the matching process to obey a  sub-
2728       pattern  conditionally  or to choose between two alternative       pattern  conditionally  or to choose between two alternative
2729       subpatterns, depending on the result  of  an  assertion,  or       subpatterns, depending on the result  of  an  assertion,  or
# Line 1775  CONDITIONAL SUBPATTERNS Line 2738  CONDITIONAL SUBPATTERNS
2738       more than two alternatives in the subpattern, a compile-time       more than two alternatives in the subpattern, a compile-time
2739       error occurs.       error occurs.
2740    
2741       There are two kinds of condition. If the  text  between  the       There are three kinds of condition. If the text between  the
2742       parentheses  consists of a sequence of digits, the condition       parentheses  consists of a sequence of digits, the condition
2743       is satisfied if the capturing subpattern of that number  has       is satisfied if the capturing subpattern of that number  has
2744       previously  matched.  Consider  the following pattern, which       previously  matched.  The  number must be greater than zero.
2745       contains non-significant white space to make it  more  read-       Consider  the  following  pattern,   which   contains   non-
2746       able (assume the PCRE_EXTENDED option) and to divide it into       significant white space to make it more readable (assume the
2747       three parts for ease of discussion:       PCRE_EXTENDED option) and to divide it into three parts  for
2748         ease of discussion:
2749    
2750         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
2751    
# Line 1798  CONDITIONAL SUBPATTERNS Line 2762  CONDITIONAL SUBPATTERNS
2762       matches a sequence of non-parentheses,  optionally  enclosed       matches a sequence of non-parentheses,  optionally  enclosed
2763       in parentheses.       in parentheses.
2764    
2765       If the condition is not a sequence of digits, it must be  an       If the condition is the string (R), it  is  satisfied  if  a
2766       assertion.  This  may be a positive or negative lookahead or       recursive  call  to the pattern or subpattern has been made.
2767       lookbehind assertion. Consider this pattern, again  contain-       At "top level", the condition is  false.   This  is  a  PCRE
2768       ing  non-significant  white space, and with the two alterna-       extension.  Recursive  patterns  are  described  in the next
2769       tives on the second line:       section.
2770    
2771         If the condition is not a sequence of digits or (R), it must
2772         be  an assertion.  This may be a positive or negative looka-
2773         head or lookbehind assertion. Consider this  pattern,  again
2774         containing  non-significant  white  space,  and with the two
2775         alternatives on the second line:
2776    
2777         (?(?=[^a-z]*[a-z])         (?(?=[^a-z]*[a-z])
2778         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
# Line 1817  CONDITIONAL SUBPATTERNS Line 2787  CONDITIONAL SUBPATTERNS
2787       letters and dd are digits.       letters and dd are digits.
2788    
2789    
   
2790  COMMENTS  COMMENTS
2791    
2792       The sequence (?# marks the start of a comment which  contin-       The sequence (?# marks the start of a comment which  contin-
2793       ues  up  to the next closing parenthesis. Nested parentheses       ues  up  to the next closing parenthesis. Nested parentheses
2794       are not permitted. The characters that  make  up  a  comment       are not permitted. The characters that  make  up  a  comment
# Line 1829  COMMENTS Line 2799  COMMENTS
2799       ues up to the next newline character in the pattern.       ues up to the next newline character in the pattern.
2800    
2801    
   
2802  RECURSIVE PATTERNS  RECURSIVE PATTERNS
2803    
2804       Consider the problem of matching a  string  in  parentheses,       Consider the problem of matching a  string  in  parentheses,
2805       allowing  for  unlimited nested parentheses. Without the use       allowing  for  unlimited nested parentheses. Without the use
2806       of recursion, the best that can be done is to use a  pattern       of recursion, the best that can be done is to use a  pattern
2807       that  matches  up  to some fixed depth of nesting. It is not       that  matches  up  to some fixed depth of nesting. It is not
2808       possible to handle an arbitrary nesting depth. Perl 5.6  has       possible to handle an arbitrary nesting depth. Perl has pro-
2809       provided   an  experimental  facility  that  allows  regular       vided  an  experimental facility that allows regular expres-
2810       expressions to recurse (amongst other things). It does  this       sions to recurse (amongst other things).  It  does  this  by
2811       by  interpolating  Perl  code in the expression at run time,       interpolating  Perl  code in the expression at run time, and
2812       and the code can refer to the expression itself. A Perl pat-       the code can refer to the expression itself. A Perl  pattern
2813       tern  to  solve  the parentheses problem can be created like       to solve the parentheses problem can be created like this:
      this:  
2814    
2815         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2816    
2817       The (?p{...}) item interpolates Perl code at run  time,  and       The (?p{...}) item interpolates Perl code at run  time,  and
2818       in  this  case refers recursively to the pattern in which it       in  this  case refers recursively to the pattern in which it
2819       appears. Obviously, PCRE cannot support the interpolation of       appears. Obviously, PCRE cannot support the interpolation of
2820       Perl  code.  Instead,  the special item (?R) is provided for       Perl  code.  Instead,  it  supports  some special syntax for
2821       the specific case of recursion. This PCRE pattern solves the       recursion of the entire pattern,  and  also  for  individual
2822       parentheses  problem (assume the PCRE_EXTENDED option is set       subpattern recursion.
2823       so that white space is ignored):  
2824         The special item that consists of (? followed  by  a  number
2825         greater  than  zero and a closing parenthesis is a recursive
2826         call of the subpattern of the given number, provided that it
2827         occurs inside that subpattern. (If not, it is a "subroutine"
2828         call, which is described in the next section.)  The  special
2829         item  (?R) is a recursive call of the entire regular expres-
2830         sion.
2831    
2832         For example, this PCRE pattern solves the nested parentheses
2833         problem  (assume  the  PCRE_EXTENDED  option  is set so that
2834         white space is ignored):
2835    
2836         \( ( (?>[^()]+) | (?R) )* \)         \( ( (?>[^()]+) | (?R) )* \)
2837    
2838       First it matches an opening parenthesis. Then it matches any       First it matches an opening parenthesis. Then it matches any
2839       number  of substrings which can either be a sequence of non-       number  of substrings which can either be a sequence of non-
2840       parentheses, or a recursive  match  of  the  pattern  itself       parentheses, or a recursive  match  of  the  pattern  itself
2841       (i.e. a correctly parenthesized substring). Finally there is       (that  is  a  correctly  parenthesized  substring).  Finally
2842       a closing parenthesis.       there is a closing parenthesis.
2843    
2844         If this were part of a larger pattern, you would not want to
2845         recurse the entire pattern, so instead you could use this:
2846    
2847           ( \( ( (?>[^()]+) | (?1) )* \) )
2848    
2849         We have put the pattern into  parentheses,  and  caused  the
2850         recursion  to refer to them instead of the whole pattern. In
2851         a larger pattern, keeping track of parenthesis  numbers  can
2852         be   tricky.   It  may  be  more  convenient  to  use  named
2853         parentheses instead. For this, PCRE uses (?P>name), which is
2854         an  extension  to the Python syntax that PCRE uses for named
2855         parentheses (Perl does not provide  named  parentheses).  We
2856         could rewrite the above example as follows:
2857    
2858           (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2859    
2860       This particular example pattern  contains  nested  unlimited       This particular example pattern  contains  nested  unlimited
2861       repeats, and so the use of a once-only subpattern for match-       repeats,  and  so  the  use  of atomic grouping for matching
2862       ing strings of non-parentheses is  important  when  applying       strings of non-parentheses is important  when  applying  the
2863       the  pattern to strings that do not match. For example, when       pattern to strings that do not match. For example, when this
2864       it is applied to       pattern is applied to
2865    
2866         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2867    
2868       it yields "no match" quickly. However, if a  once-only  sub-       it yields "no match" quickly. However, if atomic grouping is
2869       pattern  is  not  used,  the match runs for a very long time       not used, the match runs for a very long time indeed because
2870       indeed because there are so many different ways the + and  *       there are so many different ways the +  and  *  repeats  can
2871       repeats  can carve up the subject, and all have to be tested       carve  up  the  subject,  and  all  have to be tested before
2872       before failure can be reported.       failure can be reported.
2873         At the end of a match, the values set for any capturing sub-
2874       The values set for any capturing subpatterns are those  from       patterns are those from the outermost level of the recursion
2875       the outermost level of the recursion at which the subpattern       at which the subpattern value is set.  If you want to obtain
2876       value is set. If the pattern above is matched against       intermediate  values,  a  callout  function can be used (see
2877         below and the pcrecallout  documentation).  If  the  pattern
2878         above is matched against
2879    
2880         (ab(cd)ef)         (ab(cd)ef)
2881    
# Line 1887  RECURSIVE PATTERNS Line 2885  RECURSIVE PATTERNS
2885    
2886         \( ( ( (?>[^()]+) | (?R) )* ) \)         \( ( ( (?>[^()]+) | (?R) )* ) \)
2887            ^                        ^            ^                        ^
2888            ^                        ^ the string they  capture  is            ^                        ^
2889       "ab(cd)ef",  the  contents  of the top level parentheses. If  
2890       there are more than 15 capturing parentheses in  a  pattern,       the string they capture is "ab(cd)ef", the contents  of  the
2891       PCRE  has  to  obtain  extra  memory  to store data during a       top  level  parentheses. If there are more than 15 capturing
2892       recursion, which it does by using  pcre_malloc,  freeing  it       parentheses in a pattern, PCRE has to obtain extra memory to
2893       via  pcre_free  afterwards. If no memory can be obtained, it       store  data  during  a  recursion,  which  it  does by using
2894       saves data for the first 15 capturing parentheses  only,  as       pcre_malloc, freeing it  via  pcre_free  afterwards.  If  no
2895       there is no way to give an out-of-memory error from within a       memory   can   be   obtained,   the  match  fails  with  the
2896       recursion.       PCRE_ERROR_NOMEMORY error.
2897    
2898         Do not confuse the (?R) item with the condition  (R),  which
2899         tests  for  recursion.  Consider this pattern, which matches
2900         text in angle brackets, allowing for arbitrary nesting. Only
2901         digits are allowed in nested brackets (that is, when recurs-
2902         ing), whereas any characters  are  permitted  at  the  outer
2903         level.
2904    
2905           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
2906    
2907         In this pattern, (?(R) is the start of a conditional subpat-
2908         tern,  with two different alternatives for the recursive and
2909         non-recursive cases. The (?R) item is the  actual  recursive
2910         call.
2911    
2912    
2913    SUBPATTERNS AS SUBROUTINES
2914    
2915         If the syntax for a recursive subpattern  reference  (either
2916         by  number  or  by  name) is used outside the parentheses to
2917         which it refers, it operates like a subroutine in a program-
2918         ming  language. An earlier example pointed out that the pat-
2919         tern
2920    
2921           (sens|respons)e and \1ibility
2922    
2923         matches "sense and sensibility" and "response and  responsi-
2924         bility",  but not "sense and responsibility". If instead the
2925         pattern
2926    
2927           (sens|respons)e and (?1)ibility
2928    
2929         is used, it does match "sense and responsibility" as well as
2930         the other two strings. Such references must, however, follow
2931         the subpattern to which they refer.
2932    
2933    
2934    CALLOUTS
2935    
2936         Perl has a  feature  whereby  using  the  sequence  (?{...})
2937         causes  arbitrary  Perl  code  to be obeyed in the middle of
2938         matching a  regular  expression.  This  makes  it  possible,
2939         amongst  other  things, to extract different substrings that
2940         match the same pair of parentheses when there is  a  repeti-
2941         tion.
2942    
2943         PCRE provides a similar feature, but  of  course  it  cannot
2944         obey  arbitrary  Perl code. The feature is called "callout".
2945         The caller of PCRE provides an external function by  putting
2946         its  entry  point  in  the global variable pcre_callout.  By
2947         default, this variable contains  NULL,  which  disables  all
2948         calling out.
2949    
2950         Within a regular expression, (?C) indicates  the  points  at
2951         which  the external function is to be called. If you want to
2952         identify different callout points, you can put a number less
2953         than 256 after the letter C. The default value is zero.  For
2954         example, this pattern has two callout points:
2955    
2956           (?C1)9abc(?C2)def
2957    
2958         During matching, when PCRE  reaches  a  callout  point  (and
2959         pcre_callout is set), the external function is called. It is
2960         provided with the number of the  callout,  and,  optionally,
2961         one  item  of  data  originally  supplied  by  the caller of
2962         pcre_exec(). The callout  function  may  cause  matching  to
2963         backtrack,  or to fail altogether. A complete description of
2964         the interface to the callout function is given in the  pcre-
2965         callout documentation.
2966    
2967    Last updated: 03 February 2003
2968    Copyright (c) 1997-2003 University of Cambridge.
2969    -----------------------------------------------------------------------------
2970    
2971    NAME
2972         PCRE - Perl-compatible regular expressions
2973    
2974    
2975    PCRE PERFORMANCE
2976    
2977  PERFORMANCE       Certain items that may appear in regular expression patterns
2978       Certain items that may appear in patterns are more efficient       are  more efficient than others. It is more efficient to use
2979       than  others.  It is more efficient to use a character class       a character class like [aeiou] than a  set  of  alternatives
2980       like [aeiou] than a set of alternatives such as (a|e|i|o|u).       such  as  (a|e|i|o|u). In general, the simplest construction
2981       In  general,  the  simplest  construction  that provides the       that provides the required behaviour  is  usually  the  most
2982       required behaviour is usually the  most  efficient.  Jeffrey       efficient.  Jeffrey  Friedl's book contains a lot of discus-
2983       Friedl's  book contains a lot of discussion about optimizing       sion about optimizing regular expressions for efficient per-
2984       regular expressions for efficient performance.       formance.
2985    
2986       When a pattern begins with .* and the PCRE_DOTALL option  is       When a pattern begins with .*  not  in  parentheses,  or  in
2987       set,  the  pattern  is implicitly anchored by PCRE, since it       parentheses that are not the subject of a backreference, and
2988       can match only at the start of a subject string. However, if       the PCRE_DOTALL option is set,  the  pattern  is  implicitly
2989       PCRE_DOTALL  is not set, PCRE cannot make this optimization,       anchored  by PCRE, since it can match only at the start of a
2990       because the . metacharacter does not then match  a  newline,       subject string. However, if PCRE_DOTALL  is  not  set,  PCRE
2991       and if the subject string contains newlines, the pattern may       cannot  make  this optimization, because the . metacharacter
2992       match from the character immediately following one  of  them       does not then match a newline, and  if  the  subject  string
2993       instead of from the very start. For example, the pattern       contains  newlines, the pattern may match from the character
2994         immediately following one of them instead of from  the  very
2995         start. For example, the pattern
2996    
2997         (.*) second         .*second
2998    
2999       matches the subject "first\nand second" (where \n stands for       matches the subject "first\nand second" (where \n stands for
3000       a newline character) with the first captured substring being       a newline character), with the match starting at the seventh
3001       "and". In order to do this, PCRE  has  to  retry  the  match       character. In order to do this, PCRE has to retry the  match
3002       starting after every newline in the subject.       starting after every newline in the subject.
3003    
3004       If you are using such a pattern with subject strings that do       If you are using such a pattern with subject strings that do
# Line 1944  PERFORMANCE Line 3021  PERFORMANCE
3021       that  the entire match is going to fail, PCRE has in princi-       that  the entire match is going to fail, PCRE has in princi-
3022       ple to try every possible variation, and this  can  take  an       ple to try every possible variation, and this  can  take  an
3023       extremely long time.       extremely long time.
   
3024       An optimization catches some of the more simple  cases  such       An optimization catches some of the more simple  cases  such
3025       as       as
3026    
# Line 1964  PERFORMANCE Line 3040  PERFORMANCE
3040       whereas the latter takes an appreciable  time  with  strings       whereas the latter takes an appreciable  time  with  strings
3041       longer than about 20 characters.       longer than about 20 characters.
3042    
3043    Last updated: 03 February 2003
3044    Copyright (c) 1997-2003 University of Cambridge.
3045    -----------------------------------------------------------------------------
3046    
3047    NAME
3048         PCRE - Perl-compatible regular expressions.
3049    
3050    
3051    SYNOPSIS OF POSIX API
3052         #include <pcreposix.h>
3053    
3054         int regcomp(regex_t *preg, const char *pattern,
3055              int cflags);
3056    
3057         int regexec(regex_t *preg, const char *string,
3058              size_t nmatch, regmatch_t pmatch[], int eflags);
3059    
3060         size_t regerror(int errcode, const regex_t *preg,
3061              char *errbuf, size_t errbuf_size);
3062    
3063         void regfree(regex_t *preg);
3064    
3065    
3066    DESCRIPTION
3067    
3068         This set of functions provides a POSIX-style API to the PCRE
3069         regular  expression  package.  See the pcreapi documentation
3070         for a description of the native API,  which  contains  addi-
3071         tional functionality.
3072    
3073         The functions described here are just wrapper functions that
3074         ultimately  call  the  PCRE native API. Their prototypes are
3075         defined in the pcreposix.h header file, and on Unix  systems
3076         the library itself is called pcreposix.a, so can be accessed
3077         by adding -lpcreposix to the command for linking an applica-
3078         tion  which  uses them. Because the POSIX functions call the
3079         native ones, it is also necessary to add -lpcre.
3080    
3081         I have implemented only those option bits that can  be  rea-
3082         sonably  mapped  to  PCRE  native  options. In addition, the
3083         options REG_EXTENDED and  REG_NOSUB  are  defined  with  the
3084         value zero. They have no effect, but since programs that are
3085         written to the POSIX interface often use them, this makes it
3086         easier to slot in PCRE as a replacement library. Other POSIX
3087         options are not even defined.
3088    
3089         When PCRE is called via these functions, it is only the  API
3090         that is POSIX-like in style. The syntax and semantics of the
3091         regular expressions themselves are still those of Perl, sub-
3092         ject  to  the  setting of various PCRE options, as described
3093         below. "POSIX-like in style" means that the API approximates
3094         to  the  POSIX definition; it is not fully POSIX-compatible,
3095         and in multi-byte encoding domains it is probably even  less
3096         compatible.
3097    
3098         The header for these functions is supplied as pcreposix.h to
3099         avoid  any  potential  clash  with other POSIX libraries. It
3100         can, of course, be renamed or aliased as regex.h,  which  is
3101         the "correct" name. It provides two structure types, regex_t
3102         for compiled internal forms, and  regmatch_t  for  returning
3103         captured  substrings.  It  also defines some constants whose
3104         names start with "REG_"; these are used for setting  options
3105         and identifying error codes.
3106    
3107    
3108    COMPILING A PATTERN
3109    
3110         The function regcomp() is called to compile a  pattern  into
3111         an  internal form. The pattern is a C string terminated by a
3112         binary zero, and is passed in the argument pattern. The preg
3113         argument  is  a pointer to a regex_t structure which is used
3114         as a base for storing information about the compiled expres-
3115         sion.
3116    
3117         The argument cflags is either zero, or contains one or  more
3118         of the bits defined by the following macros:
3119    
3120           REG_ICASE
3121    
3122         The PCRE_CASELESS option  is  set  when  the  expression  is
3123         passed for compilation to the native function.
3124    
3125           REG_NEWLINE
3126    
3127         The PCRE_MULTILINE option is  set  when  the  expression  is
3128         passed  for  compilation  to  the native function. Note that
3129         this  does  not  mimic  the  defined  POSIX  behaviour   for
3130         REG_NEWLINE (see the following section).
3131    
3132         In the absence of these flags, no options are passed to  the
3133         native  function.  This means the the regex is compiled with
3134         PCRE default semantics. In particular, the  way  it  handles
3135         newline  characters  in  the subject string is the Perl way,
3136         not the POSIX way. Note that setting PCRE_MULTILINE has only
3137         some  of  the effects specified for REG_NEWLINE. It does not
3138         affect the way newlines are matched by . (they aren't) or by
3139         a negative class such as [^a] (they are).
3140    
3141         The yield of regcomp() is zero on success, and non-zero oth-
3142         erwise.  The preg structure is filled in on success, and one
3143         member of the structure  is  public:  re_nsub  contains  the
3144         number  of  capturing subpatterns in the regular expression.
3145         Various error codes are defined in the header file.
3146    
3147    
3148    MATCHING NEWLINE CHARACTERS
3149    
3150         This area is not simple, because POSIX and  Perl  take  dif-
3151         ferent  views  of things.  It is not possible to get PCRE to
3152         obey POSIX semantics, but then PCRE was never intended to be
3153         a POSIX engine. The following table lists the different pos-
3154         sibilities for matching newline characters in PCRE:
3155    
3156                                   Default   Change with
3157    
3158           . matches newline          no     PCRE_DOTALL
3159           newline matches [^a]       yes    not changeable
3160           $ matches \n at end        yes    PCRE_DOLLARENDONLY
3161           $ matches \n in middle     no     PCRE_MULTILINE
3162           ^ matches \n in middle     no     PCRE_MULTILINE
3163    
3164         This is the equivalent table for POSIX:
3165    
3166                                   Default   Change with
3167    
3168           . matches newline          yes      REG_NEWLINE
3169           newline matches [^a]       yes      REG_NEWLINE
3170           $ matches \n at end        no       REG_NEWLINE
3171           $ matches \n in middle     no       REG_NEWLINE
3172           ^ matches \n in middle     no       REG_NEWLINE
3173    
3174         PCRE's behaviour is the same as Perl's, except that there is
3175         no  equivalent  for PCRE_DOLLARENDONLY in Perl. In both PCRE
3176         and Perl, there is no way  to  stop  newline  from  matching
3177         [^a].
3178    
3179         The default POSIX newline handling can be obtained  by  set-
3180         ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3181         to make PCRE behave exactly as for the REG_NEWLINE action.
3182    
3183    
3184    MATCHING A PATTERN
3185    
3186         The function regexec() is called  to  match  a  pre-compiled
3187         pattern  preg against a given string, which is terminated by
3188         a zero byte, subject to the options in eflags. These can be:
3189    
3190           REG_NOTBOL
3191    
3192         The PCRE_NOTBOL option is set when  calling  the  underlying
3193         PCRE matching function.
3194    
3195           REG_NOTEOL
3196    
3197         The PCRE_NOTEOL option is set when  calling  the  underlying
3198         PCRE matching function.
3199    
3200         The portion of the string that was  matched,  and  also  any
3201         captured  substrings,  are returned via the pmatch argument,
3202         which points to  an  array  of  nmatch  structures  of  type
3203         regmatch_t,  containing  the  members rm_so and rm_eo. These
3204         contain the offset to the first character of each  substring
3205         and  the offset to the first character after the end of each
3206         substring, respectively.  The  0th  element  of  the  vector
3207         relates  to  the  entire portion of string that was matched;
3208         subsequent elements relate to the capturing  subpatterns  of
3209         the  regular  expression.  Unused  entries in the array have
3210         both structure members set to -1.
3211    
3212         A successful match yields a zero return; various error codes
3213         are  defined in the header file, of which REG_NOMATCH is the
3214         "expected" failure code.
3215    
3216    
3217    ERROR MESSAGES
3218    
3219         The regerror()  function  maps  a  non-zero  errorcode  from
3220         either  regcomp()  or  regexec()  to a printable message. If
3221         preg is not NULL, the error should have arisen from the  use
3222         of  that structure. A message terminated by a binary zero is
3223         placed in errbuf. The length of the message,  including  the
3224         zero,  is  limited to errbuf_size. The yield of the function
3225         is the size of buffer needed to hold the whole message.
3226    
3227    
3228    STORAGE
3229    
3230         Compiling a regular expression causes memory to be allocated
3231         and  associated  with  the preg structure. The function reg-
3232         free() frees all such memory, after which preg may no longer
3233         be used as a compiled expression.
3234    
3235    
3236  AUTHOR  AUTHOR
3237    
3238       Philip Hazel <ph10@cam.ac.uk>       Philip Hazel <ph10@cam.ac.uk>
3239       University Computing Service,       University Computing Service,
      New Museums Site,  
3240       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
      Phone: +44 1223 334714  
3241    
3242       Last updated: 27 January 2000  Last updated: 03 February 2003
3243       Copyright (c) 1997-2000 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
3244    -----------------------------------------------------------------------------
3245    
3246    NAME
3247         PCRE - Perl-compatible regular expressions
3248    
3249    
3250    PCRE SAMPLE PROGRAM
3251    
3252         A simple, complete demonstration program, to get you started
3253         with  using  PCRE, is supplied in the file pcredemo.c in the
3254         PCRE distribution.
3255    
3256         The program compiles the  regular  expression  that  is  its
3257         first argument, and matches it against the subject string in
3258         its second argument. No PCRE options are  set,  and  default
3259         character tables are used. If matching succeeds, the program
3260         outputs the portion of the subject  that  matched,  together
3261         with the contents of any captured substrings.
3262    
3263         If the -g option is given on the command line,  the  program
3264         then  goes on to check for further matches of the same regu-
3265         lar expression in the same subject string. The  logic  is  a
3266         little  bit tricky because of the possibility of matching an
3267         empty string. Comments in the code explain what is going on.
3268    
3269         On a Unix system that has PCRE installed in /usr/local,  you
3270         can  compile  the demonstration program using a command like
3271         this:
3272    
3273           gcc -o pcredemo pcredemo.c -I/usr/local/include \
3274               -L/usr/local/lib -lpcre
3275    
3276         Then you can run simple tests like this:
3277    
3278           ./pcredemo 'cat|dog' 'the cat sat on the mat'
3279           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3280    
3281         Note that there is a much more comprehensive  test  program,
3282         called  pcretest,  which  supports  many more facilities for
3283         testing  regular  expressions  and  the  PCRE  library.  The
3284         pcredemo program is provided as a simple coding example.
3285    
3286         On some operating systems (e.g.  Solaris)  you  may  get  an
3287         error like this when you try to run pcredemo:
3288    
3289           ld.so.1: a.out: fatal: libpcre.so.0: open failed: No  such
3290         file or directory
3291    
3292         This is caused by the way shared library  support  works  on
3293         those systems. You need to add
3294    
3295           -R/usr/local/lib
3296    
3297         to the compile command to get round this problem.
3298    
3299    Last updated: 28 January 2003
3300    Copyright (c) 1997-2003 University of Cambridge.
3301    -----------------------------------------------------------------------------
3302    

Legend:
Removed from v.47  
changed lines
  Added in v.69

  ViewVC Help
Powered by ViewVC 1.1.5