/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC revision 83 by nigel, Sat Feb 24 21:41:06 2007 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
# Line 5  synopses of each function in the library Line 6  synopses of each function in the library
6  separate text files for the pcregrep and pcretest commands.  separate text files for the pcregrep and pcretest commands.
7  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
8    
 PCRE(3)                                                                PCRE(3)  
9    
10    PCRE(3)                                                                PCRE(3)
11    
12    
13  NAME  NAME
14         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
15    
16  DESCRIPTION  
17    INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few  differences.  The current implementation of PCRE (release
22         4.x) corresponds approximately with Perl  5.8,  including  support  for         6.x) corresponds approximately with Perl  5.8,  including  support  for
23         UTF-8  encoded  strings.   However,  this  support has to be explicitly         UTF-8 encoded strings and Unicode general category properties. However,
24         enabled; it is not the default.         this support has to be explicitly enabled; it is not the default.
25    
26         PCRE is written in C and released as a C library. However, a number  of         In addition to the Perl-compatible matching function,  PCRE  also  con-
27         people  have  written  wrappers  and interfaces of various kinds. A C++         tains  an  alternative matching function that matches the same compiled
28         class is included in these contributions, which can  be  found  in  the         patterns in a different way. In certain circumstances, the  alternative
29           function  has  some  advantages.  For  a discussion of the two matching
30           algorithms, see the pcrematching page.
31    
32           PCRE is written in C and released as a C library. A  number  of  people
33           have  written  wrappers and interfaces of various kinds. In particular,
34           Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now
35           included as part of the PCRE distribution. The pcrecpp page has details
36           of this interface. Other people's contributions can  be  found  in  the
37         Contrib directory at the primary FTP site, which is:         Contrib directory at the primary FTP site, which is:
38    
39         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
# Line 34  DESCRIPTION Line 44  DESCRIPTION
44    
45         Some  features  of  PCRE can be included, excluded, or changed when the         Some  features  of  PCRE can be included, excluded, or changed when the
46         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
47         client  to  discover  which features are available. Documentation about         client  to  discover  which  features are available. The features them-
48         building PCRE for various operating systems can be found in the  README         selves are described in the pcrebuild page. Documentation about  build-
49         file in the source distribution.         ing  PCRE for various operating systems can be found in the README file
50           in the source distribution.
51    
52           The library contains a number of undocumented  internal  functions  and
53           data  tables  that  are  used by more than one of the exported external
54           functions, but which are not intended  for  use  by  external  callers.
55           Their  names  all begin with "_pcre_", which hopefully will not provoke
56           any name clashes. In some environments, it is possible to control which
57           external  symbols  are  exported when a shared library is built, and in
58           these cases the undocumented symbols are not exported.
59    
60    
61  USER DOCUMENTATION  USER DOCUMENTATION
62    
63         The user documentation for PCRE has been split up into a number of dif-         The user documentation for PCRE comprises a number  of  different  sec-
64         ferent sections. In the "man" format, each of these is a separate  "man         tions.  In the "man" format, each of these is a separate "man page". In
65         page".  In  the  HTML  format, each is a separate page, linked from the         the HTML format, each is a separate page, linked from the  index  page.
66         index page. In the plain text format, all  the  sections  are  concate-         In  the  plain text format, all the sections are concatenated, for ease
67         nated, for ease of searching. The sections are as follows:         of searching. The sections are as follows:
68    
69           pcre              this document           pcre              this document
70           pcreapi           details of PCRE's native API           pcreapi           details of PCRE's native C API
71           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
72           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
73           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
74             pcrecpp           details of the C++ wrapper
75           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
76             pcrematching      discussion of the two matching algorithms
77             pcrepartial       details of the partial matching facility
78           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
79                               regular expressions                               regular expressions
80           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
81           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible C API
82             pcreprecompile    details of saving and re-using precompiled patterns
83           pcresample        discussion of the sample program           pcresample        discussion of the sample program
84           pcretest          the pcretest testing command           pcretest          description of the pcretest testing command
85    
86         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
87         each library function, listing its arguments and results.         each C library function, listing its arguments and results.
88    
89    
90  LIMITATIONS  LIMITATIONS
# Line 74  LIMITATIONS Line 97  LIMITATIONS
97         process  regular  expressions  that are truly enormous, you can compile         process  regular  expressions  that are truly enormous, you can compile
98         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
99         the  source  distribution and the pcrebuild documentation for details).         the  source  distribution and the pcrebuild documentation for details).
100         If these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
101         of execution will be slower.         of execution will be slower.
102    
103         All values in repeating quantifiers must be less than 65536.  The maxi-         All values in repeating quantifiers must be less than 65536.  The maxi-
# Line 86  LIMITATIONS Line 109  LIMITATIONS
109         tern, is 200.         tern, is 200.
110    
111         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
112         that an integer variable can hold. However, PCRE uses recursion to han-         that an integer variable can hold. However, when using the  traditional
113         dle  subpatterns  and indefinite repetition. This means that the avail-         matching function, PCRE uses recursion to handle subpatterns and indef-
114         able stack space may limit the size of a subject  string  that  can  be         inite repetition.  This means that the available stack space may  limit
115         processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
   
116    
 UTF-8 SUPPORT  
117    
118         Starting  at  release  3.3,  PCRE  has  had  some support for character  UTF-8 AND UNICODE PROPERTY SUPPORT
119         strings encoded in the UTF-8 format. For  release  4.0  this  has  been  
120         greatly extended to cover most common requirements.         From release 3.3, PCRE has  had  some  support  for  character  strings
121           encoded  in the UTF-8 format. For release 4.0 this was greatly extended
122           to cover most common requirements, and in release 5.0  additional  sup-
123           port for Unicode general category properties was added.
124    
125         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
126         support in the code, and, in addition,  you  must  call  pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
# Line 109  UTF-8 SUPPORT Line 133  UTF-8 SUPPORT
133         is  limited  to testing the PCRE_UTF8 flag in several places, so should         is  limited  to testing the PCRE_UTF8 flag in several places, so should
134         not be very large.         not be very large.
135    
136           If PCRE is built with Unicode character property support (which implies
137           UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
138           ported.  The available properties that can be tested are limited to the
139           general  category  properties such as Lu for an upper case letter or Nd
140           for a decimal number. A full list is given in the pcrepattern  documen-
141           tation. The PCRE library is increased in size by about 90K when Unicode
142           property support is included.
143    
144         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
145    
146         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
# Line 136  UTF-8 SUPPORT Line 168  UTF-8 SUPPORT
168         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
169         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
170    
171         5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
172         single byte.         gle byte.
173    
174         6. The escape sequence \C can be used to match a single byte  in  UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
175         mode, but its use can lead to some strange effects.         mode,  but  its  use can lead to some strange effects. This facility is
176           not available in the alternative matching function, pcre_dfa_exec().
        7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly  
        test characters of any code value, but the characters that PCRE  recog-  
        nizes  as  digits,  spaces,  or  word characters remain the same set as  
        before, all with values less than 256.  
   
        8. Case-insensitive matching applies only to  characters  whose  values  
        are  less  than  256.  PCRE  does  not support the notion of "case" for  
        higher-valued characters.  
177    
178         9. PCRE does not support the use of Unicode tables  and  properties  or         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
179         the Perl escapes \p, \P, and \X.         test  characters of any code value, but the characters that PCRE recog-
180           nizes as digits, spaces, or word characters  remain  the  same  set  as
181           before, all with values less than 256. This remains true even when PCRE
182           includes Unicode property support, because to do otherwise  would  slow
183           down  PCRE in many common cases. If you really want to test for a wider
184           sense of, say, "digit", you must use Unicode  property  tests  such  as
185           \p{Nd}.
186    
187           8.  Similarly,  characters that match the POSIX named character classes
188           are all low-valued characters.
189    
190           9. Case-insensitive matching applies only to  characters  whose  values
191           are  less than 128, unless PCRE is built with Unicode property support.
192           Even when Unicode property support is available, PCRE  still  uses  its
193           own  character  tables when checking the case of low-valued characters,
194           so as not to degrade performance.  The Unicode property information  is
195           used only for characters with higher values.
196    
197    
198  AUTHOR  AUTHOR
199    
200         Philip Hazel <ph10@cam.ac.uk>         Philip Hazel
201         University Computing Service,         University Computing Service,
202         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
        Phone: +44 1223 334714  
203    
204  Last updated: 20 August 2003         Putting  an actual email address here seems to have been a spam magnet,
205  Copyright (c) 1997-2003 University of Cambridge.         so I've taken it away. If you want to email me, use my initial and sur-
206  -----------------------------------------------------------------------------         name, separated by a dot, at the domain ucs.cam.ac.uk.
207    
208    Last updated: 07 March 2005
209    Copyright (c) 1997-2005 University of Cambridge.
210    ------------------------------------------------------------------------------
211    
 PCRE(3)                                                                PCRE(3)  
212    
213    PCREBUILD(3)                                                      PCREBUILD(3)
214    
215    
216  NAME  NAME
217         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
218    
219    
220  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
221    
222         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
223         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. They are all selected, or  dese-
224         lected,  by  providing  options  to  the  configure script which is run         lected, by providing options to the configure script that is run before
225         before the make command. The complete list  of  options  for  configure         the make command. The complete list of  options  for  configure  (which
226         (which  includes the standard ones such as the selection of the instal-         includes  the  standard  ones such as the selection of the installation
227         lation directory) can be obtained by running         directory) can be obtained by running
228    
229           ./configure --help           ./configure --help
230    
# Line 192  PCRE BUILD-TIME OPTIONS Line 236  PCRE BUILD-TIME OPTIONS
236         not described.         not described.
237    
238    
239    C++ SUPPORT
240    
241           By default, the configure script will search for a C++ compiler and C++
242           header files. If it finds them, it automatically builds the C++ wrapper
243           library for PCRE. You can disable this by adding
244    
245             --disable-cpp
246    
247           to the configure command.
248    
249    
250  UTF-8 SUPPORT  UTF-8 SUPPORT
251    
252         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 character strings, add
# Line 204  UTF-8 SUPPORT Line 259  UTF-8 SUPPORT
259         function.         function.
260    
261    
262    UNICODE CHARACTER PROPERTY SUPPORT
263    
264           UTF-8 support allows PCRE to process character values greater than  255
265           in  the  strings that it handles. On its own, however, it does not pro-
266           vide any facilities for accessing the properties of such characters. If
267           you  want  to  be able to use the pattern escapes \P, \p, and \X, which
268           refer to Unicode character properties, you must add
269    
270             --enable-unicode-properties
271    
272           to the configure command. This implies UTF-8 support, even if you  have
273           not explicitly requested it.
274    
275           Including  Unicode  property  support  adds around 90K of tables to the
276           PCRE library, approximately doubling its size. Only the  general  cate-
277           gory  properties  such as Lu and Nd are supported. Details are given in
278           the pcrepattern documentation.
279    
280    
281  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
282    
283         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE treats character 10 (linefeed) as the newline  charac-
# Line 231  BUILDING SHARED AND STATIC LIBRARIES Line 305  BUILDING SHARED AND STATIC LIBRARIES
305    
306  POSIX MALLOC USAGE  POSIX MALLOC USAGE
307    
308         When PCRE is called through the  POSIX  interface  (see  the  pcreposix         When PCRE is called through the POSIX interface (see the pcreposix doc-
309         documentation),  additional working storage is required for holding the         umentation),  additional  working  storage  is required for holding the
310         pointers to capturing substrings because PCRE requires  three  integers         pointers to capturing substrings, because PCRE requires three  integers
311         per  substring,  whereas  the POSIX interface provides only two. If the         per  substring,  whereas  the POSIX interface provides only two. If the
312         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
313         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
# Line 247  POSIX MALLOC USAGE Line 321  POSIX MALLOC USAGE
321    
322  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
323    
324         Internally,  PCRE  has a function called match() which it calls repeat-         Internally,  PCRE has a function called match(), which it calls repeat-
325         edly (possibly recursively) when performing a  matching  operation.  By         edly  (possibly  recursively)  when  matching  a   pattern   with   the
326         limiting  the  number of times this function may be called, a limit can         pcre_exec()  function.  By controlling the maximum number of times this
327         be placed on the resources used by a single call  to  pcre_exec().  The         function may be called during a single matching operation, a limit  can
328         limit  can be changed at run time, as described in the pcreapi documen-         be  placed  on  the resources used by a single call to pcre_exec(). The
329         tation. The default is 10 million, but this can be changed by adding  a         limit can be changed at run time, as described in the pcreapi  documen-
330           tation.  The default is 10 million, but this can be changed by adding a
331         setting such as         setting such as
332    
333           --with-match-limit=500000           --with-match-limit=500000
334    
335         to the configure command.         to  the  configure  command.  This  setting  has  no  effect   on   the
336           pcre_dfa_exec() matching function.
337    
338    
339  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
340    
341         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
342         part to another (for example, from an opening parenthesis to an  alter-         part to another (for example, from an opening parenthesis to an  alter-
343         nation  metacharacter).  By  default two-byte values are used for these         nation  metacharacter).  By default, two-byte values are used for these
344         offsets, leading to a maximum size for a  compiled  pattern  of  around         offsets, leading to a maximum size for a  compiled  pattern  of  around
345         64K.  This  is sufficient to handle all but the most gigantic patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
346         Nevertheless, some people do want to process enormous patterns,  so  it         Nevertheless, some people do want to process enormous patterns,  so  it
# Line 285  HANDLING VERY LARGE PATTERNS Line 361  HANDLING VERY LARGE PATTERNS
361    
362  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
363    
364         PCRE  implements  backtracking while matching by making recursive calls         When matching with the pcre_exec() function, PCRE implements backtrack-
365         to an internal function called match(). In environments where the  size         ing by making recursive calls to an internal function  called  match().
366         of the stack is limited, this can severely limit PCRE's operation. (The         In  environments  where  the size of the stack is limited, this can se-
367         Unix environment does not usually suffer from this problem.) An  alter-         verely limit PCRE's operation. (The Unix environment does  not  usually
368         native  approach  that  uses  memory  from  the  heap to remember data,         suffer  from  this  problem.)  An alternative approach that uses memory
369         instead of using recursive function calls, has been implemented to work         from the heap to remember data, instead  of  using  recursive  function
370         round  this  problem. If you want to build a version of PCRE that works         calls,  has been implemented to work round this problem. If you want to
371         this way, add         build a version of PCRE that works this way, add
372    
373           --disable-stack-for-recursion           --disable-stack-for-recursion
374    
375         to the configure command. With this configuration, PCRE  will  use  the         to the configure command. With this configuration, PCRE  will  use  the
376         pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
377         management functions. Separate functions are provided because the usage         ment functions. Separate functions are provided because  the  usage  is
378         is very predictable: the block sizes requested are always the same, and         very  predictable:  the  block sizes requested are always the same, and
379         the blocks are always freed in reverse order. A calling  program  might         the blocks are always freed in reverse order. A calling  program  might
380         be  able  to implement optimized functions that perform better than the         be  able  to implement optimized functions that perform better than the
381         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
382         slowly when built in this way.         slowly when built in this way. This option affects only the pcre_exec()
383           function; it is not relevant for the the pcre_dfa_exec() function.
384    
385    
386  USING EBCDIC CODE  USING EBCDIC CODE
387    
388         PCRE  assumes  by  default that it will run in an environment where the         PCRE assumes by default that it will run in an  environment  where  the
389         character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
390         can, however, be compiled to run in an EBCDIC environment by adding         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by
391           adding
392    
393           --enable-ebcdic           --enable-ebcdic
394    
395         to the configure command.         to the configure command.
396    
397  Last updated: 09 December 2003  Last updated: 15 August 2005
398  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
399  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
400    
401    
402    PCREMATCHING(3)                                                PCREMATCHING(3)
403    
404    
405    NAME
406           PCRE - Perl-compatible regular expressions
407    
408    
409    PCRE MATCHING ALGORITHMS
410    
411           This document describes the two different algorithms that are available
412           in PCRE for matching a compiled regular expression against a given sub-
413           ject  string.  The  "standard"  algorithm  is  the  one provided by the
414           pcre_exec() function.  This works in the same was  as  Perl's  matching
415           function, and provides a Perl-compatible matching operation.
416    
417           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
418           this operates in a different way, and is not  Perl-compatible.  It  has
419           advantages  and disadvantages compared with the standard algorithm, and
420           these are described below.
421    
422           When there is only one possible way in which a given subject string can
423           match  a pattern, the two algorithms give the same answer. A difference
424           arises, however, when there are multiple possibilities. For example, if
425           the pattern
426    
427             ^<.*>
428    
429           is matched against the string
430    
431             <something> <something else> <something further>
432    
433           there are three possible answers. The standard algorithm finds only one
434           of them, whereas the DFA algorithm finds all three.
435    
436    
437    REGULAR EXPRESSIONS AS TREES
438    
439           The set of strings that are matched by a regular expression can be rep-
440           resented  as  a  tree structure. An unlimited repetition in the pattern
441           makes the tree of infinite size, but it is still a tree.  Matching  the
442           pattern  to a given subject string (from a given starting point) can be
443           thought of as a search of the tree.  There are  two  standard  ways  to
444           search  a  tree: depth-first and breadth-first, and these correspond to
445           the two matching algorithms provided by PCRE.
446    
447    
448    THE STANDARD MATCHING ALGORITHM
449    
450           In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-
451           sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a
452           depth-first search of the pattern tree. That is, it  proceeds  along  a
453           single path through the tree, checking that the subject matches what is
454           required. When there is a mismatch, the algorithm  tries  any  alterna-
455           tives  at  the  current point, and if they all fail, it backs up to the
456           previous branch point in the  tree,  and  tries  the  next  alternative
457           branch  at  that  level.  This often involves backing up (moving to the
458           left) in the subject string as well.  The  order  in  which  repetition
459           branches  are  tried  is controlled by the greedy or ungreedy nature of
460           the quantifier.
461    
462           If a leaf node is reached, a matching string has  been  found,  and  at
463           that  point the algorithm stops. Thus, if there is more than one possi-
464           ble match, this algorithm returns the first one that it finds.  Whether
465           this  is the shortest, the longest, or some intermediate length depends
466           on the way the greedy and ungreedy repetition quantifiers are specified
467           in the pattern.
468    
469           Because  it  ends  up  with a single path through the tree, it is rela-
470           tively straightforward for this algorithm to keep  track  of  the  sub-
471           strings  that  are  matched  by portions of the pattern in parentheses.
472           This provides support for capturing parentheses and back references.
473    
 PCRE(3)                                                                PCRE(3)  
474    
475    THE DFA MATCHING ALGORITHM
476    
477           DFA stands for "deterministic finite automaton", but you do not need to
478           understand the origins of that name. This algorithm conducts a breadth-
479           first search of the tree. Starting from the first matching point in the
480           subject,  it scans the subject string from left to right, once, charac-
481           ter by character, and as it does  this,  it  remembers  all  the  paths
482           through the tree that represent valid matches.
483    
484           The  scan  continues until either the end of the subject is reached, or
485           there are no more unterminated paths. At this point,  terminated  paths
486           represent  the different matching possibilities (if there are none, the
487           match has failed).  Thus, if there is more  than  one  possible  match,
488           this algorithm finds all of them, and in particular, it finds the long-
489           est. In PCRE, there is an option to stop the algorithm after the  first
490           match (which is necessarily the shortest) has been found.
491    
492           Note that all the matches that are found start at the same point in the
493           subject. If the pattern
494    
495             cat(er(pillar)?)
496    
497           is matched against the string "the caterpillar catchment",  the  result
498           will  be the three strings "cat", "cater", and "caterpillar" that start
499           at the fourth character of the subject. The algorithm does not automat-
500           ically move on to find matches that start at later positions.
501    
502           There are a number of features of PCRE regular expressions that are not
503           supported by the DFA matching algorithm. They are as follows:
504    
505           1. Because the algorithm finds all  possible  matches,  the  greedy  or
506           ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
507           ungreedy quantifiers are treated in exactly the same way.
508    
509           2. When dealing with multiple paths through the tree simultaneously, it
510           is  not  straightforward  to  keep track of captured substrings for the
511           different matching possibilities, and  PCRE's  implementation  of  this
512           algorithm does not attempt to do this. This means that no captured sub-
513           strings are available.
514    
515           3. Because no substrings are captured, back references within the  pat-
516           tern are not supported, and cause errors if encountered.
517    
518           4.  For  the same reason, conditional expressions that use a backrefer-
519           ence as the condition are not supported.
520    
521           5. Callouts are supported, but the value of the  capture_top  field  is
522           always 1, and the value of the capture_last field is always -1.
523    
524           6.  The \C escape sequence, which (in the standard algorithm) matches a
525           single byte, even in UTF-8 mode, is not supported because the DFA algo-
526           rithm moves through the subject string one character at a time, for all
527           active paths through the tree.
528    
529    
530    ADVANTAGES OF THE DFA ALGORITHM
531    
532           Using the DFA matching algorithm provides the following advantages:
533    
534           1. All possible matches (at a single point in the subject) are automat-
535           ically  found,  and  in particular, the longest match is found. To find
536           more than one match using the standard algorithm, you have to do kludgy
537           things with callouts.
538    
539           2.  There is much better support for partial matching. The restrictions
540           on the content of the pattern that apply when using the standard  algo-
541           rithm  for partial matching do not apply to the DFA algorithm. For non-
542           anchored patterns, the starting position of a partial match  is  avail-
543           able.
544    
545           3.  Because  the  DFA algorithm scans the subject string just once, and
546           never needs to backtrack, it is possible  to  pass  very  long  subject
547           strings  to  the matching function in several pieces, checking for par-
548           tial matching each time.
549    
550    
551    DISADVANTAGES OF THE DFA ALGORITHM
552    
553           The DFA algorithm suffers from a number of disadvantages:
554    
555           1. It is substantially slower than  the  standard  algorithm.  This  is
556           partly  because  it has to search for all possible matches, but is also
557           because it is less susceptible to optimization.
558    
559           2. Capturing parentheses and back references are not supported.
560    
561           3. The "atomic group" feature of PCRE regular expressions is supported,
562           but  does not provide the advantage that it does for the standard algo-
563           rithm.
564    
565    Last updated: 28 February 2005
566    Copyright (c) 1997-2005 University of Cambridge.
567    ------------------------------------------------------------------------------
568    
569    
570    PCREAPI(3)                                                          PCREAPI(3)
571    
572    
573  NAME  NAME
574         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
575    
576  SYNOPSIS OF PCRE API  
577    PCRE NATIVE API
578    
579         #include <pcre.h>         #include <pcre.h>
580    
# Line 335  SYNOPSIS OF PCRE API Line 582  SYNOPSIS OF PCRE API
582              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
583              const unsigned char *tableptr);              const unsigned char *tableptr);
584    
585           pcre *pcre_compile2(const char *pattern, int options,
586                int *errorcodeptr,
587                const char **errptr, int *erroffset,
588                const unsigned char *tableptr);
589    
590         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
591              const char **errptr);              const char **errptr);
592    
# Line 342  SYNOPSIS OF PCRE API Line 594  SYNOPSIS OF PCRE API
594              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
595              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
596    
597           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
598                const char *subject, int length, int startoffset,
599                int options, int *ovector, int ovecsize,
600                int *workspace, int wscount);
601    
602         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
603              const char *subject, int *ovector,              const char *subject, int *ovector,
604              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 377  SYNOPSIS OF PCRE API Line 634  SYNOPSIS OF PCRE API
634    
635         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
636    
637           int pcre_refcount(pcre *code, int adjust);
638    
639         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
640    
641         char *pcre_version(void);         char *pcre_version(void);
# Line 392  SYNOPSIS OF PCRE API Line 651  SYNOPSIS OF PCRE API
651         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
652    
653    
654  PCRE API  PCRE API OVERVIEW
655    
656         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
657         is also a set of wrapper functions that correspond to the POSIX regular         is also a set of wrapper functions that correspond to the POSIX regular
658         expression API.  These are described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
659           Both of these APIs define a set of C function calls. A C++  wrapper  is
660         The  native  API  function  prototypes  are  defined in the header file         distributed with PCRE. It is documented in the pcrecpp page.
661         pcre.h, and on Unix systems the library itself is called libpcre.a,  so  
662         can be accessed by adding -lpcre to the command for linking an applica-         The  native  API  C  function prototypes are defined in the header file
663         tion which calls it. The header file defines the macros PCRE_MAJOR  and         pcre.h, and on Unix systems the library itself is called  libpcre.   It
664         PCRE_MINOR  to  contain  the  major  and  minor release numbers for the         can normally be accessed by adding -lpcre to the command for linking an
665         library. Applications can use these to include  support  for  different         application  that  uses  PCRE.  The  header  file  defines  the  macros
666         releases.         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
667           bers for the library.  Applications can use these  to  include  support
668         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         for different releases of PCRE.
669         for compiling and matching regular expressions. A sample  program  that  
670         demonstrates  the simplest way of using them is given in the file pcre-         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
671         demo.c. The pcresample documentation describes how to run it.         pcre_exec() are used for compiling and matching regular expressions  in
672           a  Perl-compatible  manner. A sample program that demonstrates the sim-
673         There are convenience functions for extracting captured substrings from         plest way of using them is provided in the file  called  pcredemo.c  in
674         a matched subject string. They are:         the  source distribution. The pcresample documentation describes how to
675           run it.
676    
677           A second matching function, pcre_dfa_exec(), which is not Perl-compati-
678           ble,  is  also provided. This uses a different algorithm for the match-
679           ing. This allows it to find all possible matches (at a given  point  in
680           the  subject),  not  just  one. However, this algorithm does not return
681           captured substrings. A description of the two matching  algorithms  and
682           their  advantages  and disadvantages is given in the pcrematching docu-
683           mentation.
684    
685           In addition to the main compiling and  matching  functions,  there  are
686           convenience functions for extracting captured substrings from a subject
687           string that is matched by pcre_exec(). They are:
688    
689           pcre_copy_substring()           pcre_copy_substring()
690           pcre_copy_named_substring()           pcre_copy_named_substring()
691           pcre_get_substring()           pcre_get_substring()
692           pcre_get_named_substring()           pcre_get_named_substring()
693           pcre_get_substring_list()           pcre_get_substring_list()
694             pcre_get_stringnumber()
695    
696         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
697         to free the memory used for extracted strings.         to free the memory used for extracted strings.
698    
699         The function pcre_maketables() is used (optionally) to build a  set  of         The  function  pcre_maketables()  is  used  to build a set of character
700         character tables in the current locale for passing to pcre_compile().         tables  in  the  current  locale   for   passing   to   pcre_compile(),
701           pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
702         The  function  pcre_fullinfo()  is used to find out information about a         provided for specialist use.  Most  commonly,  no  special  tables  are
703         compiled pattern; pcre_info() is an obsolete version which returns only         passed,  in  which case internal tables that are generated when PCRE is
704         some  of  the available information, but is retained for backwards com-         built are used.
705         patibility.  The function pcre_version() returns a pointer to a  string  
706           The function pcre_fullinfo() is used to find out  information  about  a
707           compiled  pattern; pcre_info() is an obsolete version that returns only
708           some of the available information, but is retained for  backwards  com-
709           patibility.   The function pcre_version() returns a pointer to a string
710         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
711    
712         The  global  variables  pcre_malloc and pcre_free initially contain the         The function pcre_refcount() maintains a  reference  count  in  a  data
713         entry points of the standard  malloc()  and  free()  functions  respec-         block  containing  a compiled pattern. This is provided for the benefit
714           of object-oriented applications.
715    
716           The global variables pcre_malloc and pcre_free  initially  contain  the
717           entry  points  of  the  standard malloc() and free() functions, respec-
718         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
719         so a calling program can replace them if it  wishes  to  intercept  the         so  a  calling  program  can replace them if it wishes to intercept the
720         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
721    
722         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
723         indirections to memory management functions.  These  special  functions         indirections  to  memory  management functions. These special functions
724         are  used  only  when  PCRE is compiled to use the heap for remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
725         data, instead of recursive function calls. This is a  non-standard  way         data, instead of recursive function calls, when running the pcre_exec()
726         of  building  PCRE,  for  use in environments that have limited stacks.         function. This is a non-standard way of building PCRE, for use in envi-
727         Because of the greater use of memory management, it runs  more  slowly.         ronments that have limited stacks. Because of the greater use of memory
728         Separate  functions  are provided so that special-purpose external code         management, it runs more slowly.  Separate functions  are  provided  so
729         can be used for this case. When used, these functions are always called         that  special-purpose  external  code  can  be used for this case. When
730         in  a  stack-like  manner  (last obtained, first freed), and always for         used, these functions are always called in a  stack-like  manner  (last
731         memory blocks of the same size.         obtained,  first freed), and always for memory blocks of the same size.
732    
733         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
734         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 467  MULTITHREADING Line 748  MULTITHREADING
748         at once.         at once.
749    
750    
751    SAVING PRECOMPILED PATTERNS FOR LATER USE
752    
753           The compiled form of a regular expression can be saved and re-used at a
754           later time, possibly by a different program, and even on a  host  other
755           than  the  one  on  which  it  was  compiled.  Details are given in the
756           pcreprecompile documentation.
757    
758    
759  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
760    
761         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
762    
763         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
764         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
765         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
766         tures.         tures.
767    
768         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
769         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
770         into which the information is  placed.  The  following  information  is         into  which  the  information  is  placed. The following information is
771         available:         available:
772    
773           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
774    
775         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
776         able; otherwise it is set to zero.         able; otherwise it is set to zero.
777    
778             PCRE_CONFIG_UNICODE_PROPERTIES
779    
780           The  output  is  an  integer  that is set to one if support for Unicode
781           character properties is available; otherwise it is set to zero.
782    
783           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
784    
785         The output is an integer that is set to the value of the code  that  is         The output is an integer that is set to the value of the code  that  is
# Line 516  CHECKING BUILD-TIME OPTIONS Line 810  CHECKING BUILD-TIME OPTIONS
810    
811           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
812    
813         The output is an integer that is set to one if  internal  recursion  is         The output is an integer that is set to one if internal recursion  when
814         implemented  by recursive function calls that use the stack to remember         running pcre_exec() is implemented by recursive function calls that use
815         their state. This is the usual way that PCRE is compiled. The output is         the stack to remember their state. This is the usual way that  PCRE  is
816         zero  if PCRE was compiled to use blocks of data on the heap instead of         compiled. The output is zero if PCRE was compiled to use blocks of data
817         recursive  function  calls.  In  this   case,   pcre_stack_malloc   and         on the  heap  instead  of  recursive  function  calls.  In  this  case,
818         pcre_stack_free  are  called  to manage memory blocks on the heap, thus         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
819         avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
820    
821    
822  COMPILING A PATTERN  COMPILING A PATTERN
# Line 531  COMPILING A PATTERN Line 825  COMPILING A PATTERN
825              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
826              const unsigned char *tableptr);              const unsigned char *tableptr);
827    
828           pcre *pcre_compile2(const char *pattern, int options,
829                int *errorcodeptr,
830                const char **errptr, int *erroffset,
831                const unsigned char *tableptr);
832    
833         The function pcre_compile() is called to  compile  a  pattern  into  an         Either of the functions pcre_compile() or pcre_compile2() can be called
834         internal  form.  The pattern is a C string terminated by a binary zero,         to compile a pattern into an internal form. The only difference between
835         and is passed in the argument pattern. A pointer to a single  block  of         the two interfaces is that pcre_compile2() has an additional  argument,
836         memory  that is obtained via pcre_malloc is returned. This contains the         errorcodeptr, via which a numerical error code can be returned.
837         compiled code and related data.  The  pcre  type  is  defined  for  the  
838         returned  block;  this  is a typedef for a structure whose contents are         The pattern is a C string terminated by a binary zero, and is passed in
839         not externally defined. It is up to the caller to free the memory  when         the pattern argument. A pointer to a single block  of  memory  that  is
840         it is no longer required.         obtained  via  pcre_malloc is returned. This contains the compiled code
841           and related data. The pcre type is defined for the returned block; this
842           is a typedef for a structure whose contents are not externally defined.
843           It is up to the caller  to  free  the  memory  when  it  is  no  longer
844           required.
845    
846         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although  the compiled code of a PCRE regex is relocatable, that is, it
847         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
848         fully relocatable, because it contains a copy of the tableptr argument,         fully  relocatable, because it may contain a copy of the tableptr argu-
849         which is an address (see below).         ment, which is an address (see below).
850    
851         The options argument contains independent bits that affect the compila-         The options argument contains independent bits that affect the compila-
852         tion.  It  should  be  zero  if  no  options  are required. Some of the         tion.  It  should  be  zero  if  no options are required. The available
853         options, in particular, those that are compatible with Perl,  can  also         options are described below. Some of them, in  particular,  those  that
854         be  set and unset from within the pattern (see the detailed description         are  compatible  with  Perl,  can also be set and unset from within the
855         of regular expressions in the  pcrepattern  documentation).  For  these         pattern (see the detailed description  in  the  pcrepattern  documenta-
856         options,  the  contents of the options argument specifies their initial         tion).  For  these options, the contents of the options argument speci-
857         settings at the start of compilation and execution.  The  PCRE_ANCHORED         fies their initial settings at the start of compilation and  execution.
858         option can be set at the time of matching as well as at compile time.         The  PCRE_ANCHORED option can be set at the time of matching as well as
859           at compile time.
860    
861         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
862         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
863         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
864         sage. The offset from the start of the pattern to the  character  where         sage.  The  offset from the start of the pattern to the character where
865         the  error  was  discovered  is  placed  in  the variable pointed to by         the error was discovered is  placed  in  the  variable  pointed  to  by
866         erroffset, which must not be NULL. If it  is,  an  immediate  error  is         erroffset,  which  must  not  be  NULL. If it is, an immediate error is
867         given.         given.
868    
869           If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
870           codeptr  argument is not NULL, a non-zero error code number is returned
871           via this argument in the event of an error. This is in addition to  the
872           textual error message. Error codes and messages are listed below.
873    
874         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
875         character tables which are built when it is compiled, using the default         character tables that are  built  when  PCRE  is  compiled,  using  the
876         C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to         default  C  locale.  Otherwise, tableptr must be an address that is the
877         pcre_maketables(). See the section on locale support below.         result of a call to pcre_maketables(). This value is  stored  with  the
878           compiled  pattern,  and used again by pcre_exec(), unless another table
879           pointer is passed to it. For more discussion, see the section on locale
880           support below.
881    
882         This code fragment shows a typical straightforward  call  to  pcre_com-         This  code  fragment  shows a typical straightforward call to pcre_com-
883         pile():         pile():
884    
885           pcre *re;           pcre *re;
# Line 581  COMPILING A PATTERN Line 892  COMPILING A PATTERN
892             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
893             NULL);            /* use default character tables */             NULL);            /* use default character tables */
894    
895         The following option bits are defined:         The following names for option bits are defined in  the  pcre.h  header
896           file:
897    
898           PCRE_ANCHORED           PCRE_ANCHORED
899    
900         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
901         is constrained to match only at the first matching point in the  string         is constrained to match only at the first matching point in the  string
902         which is being searched (the "subject string"). This effect can also be         that  is being searched (the "subject string"). This effect can also be
903         achieved by appropriate constructs in the pattern itself, which is  the         achieved by appropriate constructs in the pattern itself, which is  the
904         only way to do it in Perl.         only way to do it in Perl.
905    
906             PCRE_AUTO_CALLOUT
907    
908           If this bit is set, pcre_compile() automatically inserts callout items,
909           all with number 255, before each pattern item. For  discussion  of  the
910           callout facility, see the pcrecallout documentation.
911    
912           PCRE_CASELESS           PCRE_CASELESS
913    
914         If  this  bit is set, letters in the pattern match both upper and lower         If  this  bit is set, letters in the pattern match both upper and lower
915         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
916         changed within a pattern by a (?i) option setting.         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
917           always understands the concept of case for characters whose values  are
918           less  than 128, so caseless matching is always possible. For characters
919           with higher values, the concept of case is supported if  PCRE  is  com-
920           piled  with Unicode property support, but not otherwise. If you want to
921           use caseless matching for characters 128 and  above,  you  must  ensure
922           that  PCRE  is  compiled  with Unicode property support as well as with
923           UTF-8 support.
924    
925           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
926    
927         If  this bit is set, a dollar metacharacter in the pattern matches only         If this bit is set, a dollar metacharacter in the pattern matches  only
928         at the end of the subject string. Without this option,  a  dollar  also         at  the  end  of the subject string. Without this option, a dollar also
929         matches  immediately before the final character if it is a newline (but         matches immediately before the final character if it is a newline  (but
930         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is         not  before  any  other  newlines).  The  PCRE_DOLLAR_ENDONLY option is
931         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         ignored if PCRE_MULTILINE is set. There is no equivalent to this option
932         in Perl, and no way to set it within a pattern.         in Perl, and no way to set it within a pattern.
933    
934           PCRE_DOTALL           PCRE_DOTALL
935    
936         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
937         acters,  including  newlines.  Without  it, newlines are excluded. This         acters, including newlines. Without it,  newlines  are  excluded.  This
938         option is equivalent to Perl's /s option, and it can be changed  within         option  is equivalent to Perl's /s option, and it can be changed within
939         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]         a pattern by a (?s) option setting.  A  negative  class  such  as  [^a]
940         always matches a newline character, independent of the setting of  this         always  matches a newline character, independent of the setting of this
941         option.         option.
942    
943           PCRE_EXTENDED           PCRE_EXTENDED
944    
945         If  this  bit  is  set,  whitespace  data characters in the pattern are         If this bit is set, whitespace  data  characters  in  the  pattern  are
946         totally ignored except  when  escaped  or  inside  a  character  class.         totally ignored except when escaped or inside a character class. White-
947         Whitespace  does  not  include the VT character (code 11). In addition,         space does not include the VT character (code 11). In addition, charac-
948         characters between an unescaped # outside a  character  class  and  the         ters between an unescaped # outside a character class and the next new-
949         next newline character, inclusive, are also ignored. This is equivalent         line character, inclusive, are also  ignored.  This  is  equivalent  to
950         to Perl's /x option, and it can be changed within a pattern by  a  (?x)         Perl's  /x  option,  and  it  can be changed within a pattern by a (?x)
951         option setting.         option setting.
952    
953         This  option  makes  it possible to include comments inside complicated         This option makes it possible to include  comments  inside  complicated
954         patterns.  Note, however, that this applies only  to  data  characters.         patterns.   Note,  however,  that this applies only to data characters.
955         Whitespace   characters  may  never  appear  within  special  character         Whitespace  characters  may  never  appear  within  special   character
956         sequences in a pattern, for  example  within  the  sequence  (?(  which         sequences  in  a  pattern,  for  example  within the sequence (?( which
957         introduces a conditional subpattern.         introduces a conditional subpattern.
958    
959           PCRE_EXTRA           PCRE_EXTRA
960    
961         This  option  was invented in order to turn on additional functionality         This option was invented in order to turn on  additional  functionality
962         of PCRE that is incompatible with Perl, but it  is  currently  of  very         of  PCRE  that  is  incompatible with Perl, but it is currently of very
963         little  use. When set, any backslash in a pattern that is followed by a         little use. When set, any backslash in a pattern that is followed by  a
964         letter that has no special meaning  causes  an  error,  thus  reserving         letter  that  has  no  special  meaning causes an error, thus reserving
965         these  combinations  for  future  expansion.  By default, as in Perl, a         these combinations for future expansion. By  default,  as  in  Perl,  a
966         backslash followed by a letter with no special meaning is treated as  a         backslash  followed by a letter with no special meaning is treated as a
967         literal.  There  are  at  present  no other features controlled by this         literal. There are at present no  other  features  controlled  by  this
968         option. It can also be set by a (?X) option setting within a pattern.         option. It can also be set by a (?X) option setting within a pattern.
969    
970             PCRE_FIRSTLINE
971    
972           If  this  option  is  set,  an  unanchored pattern is required to match
973           before or at the first newline character in the subject string,  though
974           the matched text may continue over the newline.
975    
976           PCRE_MULTILINE           PCRE_MULTILINE
977    
978         By default, PCRE treats the subject string as consisting  of  a  single         By  default,  PCRE  treats the subject string as consisting of a single
979         "line"  of  characters (even if it actually contains several newlines).         line of characters (even if it actually contains newlines). The  "start
980         The "start of line" metacharacter (^) matches only at the start of  the         of  line"  metacharacter  (^)  matches only at the start of the string,
981         string,  while  the "end of line" metacharacter ($) matches only at the         while the "end of line" metacharacter ($) matches only at  the  end  of
982         end of the string, or before a terminating  newline  (unless  PCRE_DOL-         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
983         LAR_ENDONLY is set). This is the same as Perl.         is set). This is the same as Perl.
984    
985         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
986         constructs match immediately following or immediately before  any  new-         constructs  match  immediately following or immediately before any new-
987         line  in the subject string, respectively, as well as at the very start         line in the subject string, respectively, as well as at the very  start
988         and end. This is equivalent to Perl's /m option, and it can be  changed         and  end. This is equivalent to Perl's /m option, and it can be changed
989         within a pattern by a (?m) option setting. If there are no "\n" charac-         within a pattern by a (?m) option setting. If there are no "\n" charac-
990         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,         ters  in  a  subject  string, or no occurrences of ^ or $ in a pattern,
991         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
992    
993           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
994    
995         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
996         theses in the pattern. Any opening parenthesis that is not followed  by         theses  in the pattern. Any opening parenthesis that is not followed by
997         ?  behaves as if it were followed by ?: but named parentheses can still         ? behaves as if it were followed by ?: but named parentheses can  still
998         be used for capturing (and they acquire  numbers  in  the  usual  way).         be  used  for  capturing  (and  they acquire numbers in the usual way).
999         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1000    
1001           PCRE_UNGREEDY           PCRE_UNGREEDY
1002    
1003         This  option  inverts  the "greediness" of the quantifiers so that they         This option inverts the "greediness" of the quantifiers  so  that  they
1004         are not greedy by default, but become greedy if followed by "?". It  is         are  not greedy by default, but become greedy if followed by "?". It is
1005         not  compatible  with Perl. It can also be set by a (?U) option setting         not compatible with Perl. It can also be set by a (?U)  option  setting
1006         within the pattern.         within the pattern.
1007    
1008           PCRE_UTF8           PCRE_UTF8
1009    
1010         This option causes PCRE to regard both the pattern and the  subject  as         This  option  causes PCRE to regard both the pattern and the subject as
1011         strings  of  UTF-8 characters instead of single-byte character strings.         strings of UTF-8 characters instead of single-byte  character  strings.
1012         However, it is available only if PCRE has been built to  include  UTF-8         However,  it is available only when PCRE is built to include UTF-8 sup-
1013         support.  If  not, the use of this option provokes an error. Details of         port. If not, the use of this option provokes an error. Details of  how
1014         how this option changes the behaviour of PCRE are given in the  section         this  option  changes the behaviour of PCRE are given in the section on
1015         on UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1016    
1017           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1018    
1019         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1020         automatically checked. If an invalid UTF-8 sequence of bytes is  found,         automatically  checked. If an invalid UTF-8 sequence of bytes is found,
1021         pcre_compile()  returns an error. If you already know that your pattern         pcre_compile() returns an error. If you already know that your  pattern
1022         is valid, and you want to skip this check for performance reasons,  you         is  valid, and you want to skip this check for performance reasons, you
1023         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         can set the PCRE_NO_UTF8_CHECK option. When it is set,  the  effect  of
1024         passing an invalid UTF-8 string as a pattern is undefined. It may cause         passing an invalid UTF-8 string as a pattern is undefined. It may cause
1025         your  program  to  crash.  Note that there is a similar option for sup-         your program to crash.  Note that this option can  also  be  passed  to
1026         pressing the checking of subject strings passed to pcre_exec().         pcre_exec()  and pcre_dfa_exec(), to suppress the UTF-8 validity check-
1027           ing of subject strings.
1028    
1029    
1030    COMPILATION ERROR CODES
1031    
1032           The following table lists the error  codes  than  may  be  returned  by
1033           pcre_compile2(),  along with the error messages that may be returned by
1034           both compiling functions.
1035    
1036              0  no error
1037              1  \ at end of pattern
1038              2  \c at end of pattern
1039              3  unrecognized character follows \
1040              4  numbers out of order in {} quantifier
1041              5  number too big in {} quantifier
1042              6  missing terminating ] for character class
1043              7  invalid escape sequence in character class
1044              8  range out of order in character class
1045              9  nothing to repeat
1046             10  operand of unlimited repeat could match the empty string
1047             11  internal error: unexpected repeat
1048             12  unrecognized character after (?
1049             13  POSIX named classes are supported only within a class
1050             14  missing )
1051             15  reference to non-existent subpattern
1052             16  erroffset passed as NULL
1053             17  unknown option bit(s) set
1054             18  missing ) after comment
1055             19  parentheses nested too deeply
1056             20  regular expression too large
1057             21  failed to get memory
1058             22  unmatched parentheses
1059             23  internal error: code overflow
1060             24  unrecognized character after (?<
1061             25  lookbehind assertion is not fixed length
1062             26  malformed number after (?(
1063             27  conditional group contains more than two branches
1064             28  assertion expected after (?(
1065             29  (?R or (?digits must be followed by )
1066             30  unknown POSIX class name
1067             31  POSIX collating elements are not supported
1068             32  this version of PCRE is not compiled with PCRE_UTF8 support
1069             33  spare error
1070             34  character value in \x{...} sequence is too large
1071             35  invalid condition (?(0)
1072             36  \C not allowed in lookbehind assertion
1073             37  PCRE does not support \L, \l, \N, \U, or \u
1074             38  number after (?C is > 255
1075             39  closing ) for (?C expected
1076             40  recursive call could loop indefinitely
1077             41  unrecognized character after (?P
1078             42  syntax error after (?P
1079             43  two named groups have the same name
1080             44  invalid UTF-8 string
1081             45  support for \P, \p, and \X has not been compiled
1082             46  malformed \P or \p sequence
1083             47  unknown property name after \P or \p
1084    
1085    
1086  STUDYING A PATTERN  STUDYING A PATTERN
1087    
1088         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options
1089              const char **errptr);              const char **errptr);
1090    
1091         When a pattern is going to be used several times, it is worth  spending         If a compiled pattern is going to be used several times,  it  is  worth
1092         more  time  analyzing it in order to speed up the time taken for match-         spending more time analyzing it in order to speed up the time taken for
1093         ing. The function pcre_study() takes a pointer to a compiled pattern as         matching. The function pcre_study() takes a pointer to a compiled  pat-
1094         its first argument. If studing the pattern produces additional informa-         tern as its first argument. If studying the pattern produces additional
1095         tion that will help speed up matching, pcre_study() returns  a  pointer         information that will help speed up matching,  pcre_study()  returns  a
1096         to  a  pcre_extra  block,  in  which the study_data field points to the         pointer  to a pcre_extra block, in which the study_data field points to
1097         results of the study.         the results of the study.
1098    
1099         The returned value from  a  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1100         pcre_exec().  However,  the pcre_extra block also contains other fields         pcre_exec().  However,  a  pcre_extra  block also contains other fields
1101         that can be set by the caller before the block  is  passed;  these  are         that can be set by the caller before the block  is  passed;  these  are
1102         described  below.  If  studying  the pattern does not produce any addi-         described below in the section on matching a pattern.
        tional information, pcre_study() returns NULL. In that circumstance, if  
        the  calling  program  wants  to  pass  some  of  the  other  fields to  
        pcre_exec(), it must set up its own pcre_extra block.  
1103    
1104         The second argument contains option bits. At present,  no  options  are         If  studying  the  pattern  does not produce any additional information
1105         defined for pcre_study(), and this argument should always be zero.         pcre_study() returns NULL. In that circumstance, if the calling program
1106           wants  to  pass  any of the other fields to pcre_exec(), it must set up
1107           its own pcre_extra block.
1108    
1109           The second argument of pcre_study() contains option bits.  At  present,
1110           no options are defined, and this argument should always be zero.
1111    
1112         The  third argument for pcre_study() is a pointer for an error message.         The  third argument for pcre_study() is a pointer for an error message.
1113         If studying succeeds (even if no data is  returned),  the  variable  it         If studying succeeds (even if no data is  returned),  the  variable  it
# Line 736  STUDYING A PATTERN Line 1125  STUDYING A PATTERN
1125    
1126         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1127         that do not have a single fixed starting character. A bitmap of  possi-         that do not have a single fixed starting character. A bitmap of  possi-
1128         ble starting characters is created.         ble starting bytes is created.
1129    
1130    
1131  LOCALE SUPPORT  LOCALE SUPPORT
1132    
1133         PCRE  handles  caseless matching, and determines whether characters are         PCRE  handles  caseless matching, and determines whether characters are
1134         letters, digits, or whatever, by reference to a  set  of  tables.  When         letters digits, or whatever, by reference to a set of  tables,  indexed
1135         running  in UTF-8 mode, this applies only to characters with codes less         by  character  value.  When running in UTF-8 mode, this applies only to
1136         than 256. The library contains a default set of tables that is  created         characters with codes less than 128. Higher-valued  codes  never  match
1137         in  the  default  C locale when PCRE is compiled. This is used when the         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1138         final argument of pcre_compile() is NULL, and is  sufficient  for  many         with Unicode character property support.
1139         applications.  
1140           An internal set of tables is created in the default C locale when  PCRE
1141         An alternative set of tables can, however, be supplied. Such tables are         is  built.  This  is  used when the final argument of pcre_compile() is
1142         built by calling the pcre_maketables() function,  which  has  no  argu-         NULL, and is sufficient for many applications. An  alternative  set  of
1143         ments,  in  the  relevant  locale.  The  result  can  then be passed to         tables  can,  however, be supplied. These may be created in a different
1144         pcre_compile() as often as necessary. For example,  to  build  and  use         locale from the default. As more and more applications change to  using
1145         tables that are appropriate for the French locale (where accented char-         Unicode, the need for this locale support is expected to die away.
1146         acters with codes greater than 128 are treated as letters), the follow-  
1147         ing code could be used:         External  tables  are  built by calling the pcre_maketables() function,
1148           which has no arguments, in the relevant locale. The result can then  be
1149           passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1150           example, to build and use tables that are appropriate  for  the  French
1151           locale  (where  accented  characters  with  values greater than 128 are
1152           treated as letters), the following code could be used:
1153    
1154           setlocale(LC_CTYPE, "fr");           setlocale(LC_CTYPE, "fr_FR");
1155           tables = pcre_maketables();           tables = pcre_maketables();
1156           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1157    
1158         The  tables  are  built in memory that is obtained via pcre_malloc. The         When pcre_maketables() runs, the tables are built  in  memory  that  is
1159         pointer that is passed to pcre_compile is saved with the compiled  pat-         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1160         tern, and the same tables are used via this pointer by pcre_study() and         that the memory containing the tables remains available for as long  as
1161         pcre_exec(). Thus, for any single pattern,  compilation,  studying  and         it is needed.
1162         matching  all  happen in the same locale, but different patterns can be  
1163         compiled in different locales. It is  the  caller's  responsibility  to         The pointer that is passed to pcre_compile() is saved with the compiled
1164         ensure  that  the memory containing the tables remains available for as         pattern, and the same tables are used via this pointer by  pcre_study()
1165         long as it is needed.         and normally also by pcre_exec(). Thus, by default, for any single pat-
1166           tern, compilation, studying and matching all happen in the same locale,
1167           but different patterns can be compiled in different locales.
1168    
1169           It  is  possible to pass a table pointer or NULL (indicating the use of
1170           the internal tables) to pcre_exec(). Although  not  intended  for  this
1171           purpose,  this facility could be used to match a pattern in a different
1172           locale from the one in which it was compiled. Passing table pointers at
1173           run time is discussed below in the section on matching a pattern.
1174    
1175    
1176  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
# Line 776  INFORMATION ABOUT A PATTERN Line 1178  INFORMATION ABOUT A PATTERN
1178         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1179              int what, void *where);              int what, void *where);
1180    
1181         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1182         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1183         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1184    
1185         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1186         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1187         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1188         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1189         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1190         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1191    
1192           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 792  INFORMATION ABOUT A PATTERN Line 1194  INFORMATION ABOUT A PATTERN
1194           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1195           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1196    
1197         Here  is a typical call of pcre_fullinfo(), to obtain the length of the         The "magic number" is placed at the start of each compiled  pattern  as
1198         compiled pattern:         an  simple check against passing an arbitrary memory pointer. Here is a
1199           typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1200           pattern:
1201    
1202           int rc;           int rc;
1203           unsigned long int length;           unsigned long int length;
# Line 803  INFORMATION ABOUT A PATTERN Line 1207  INFORMATION ABOUT A PATTERN
1207             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1208             &length);         /* where to put the data */             &length);         /* where to put the data */
1209    
1210         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1211         are as follows:         are as follows:
1212    
1213           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1214    
1215         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1216         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1217         there are no back references.         there are no back references.
1218    
1219           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1220    
1221         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1222         argument should point to an int variable.         argument should point to an int variable.
1223    
1224             PCRE_INFO_DEFAULT_TABLES
1225    
1226           Return  a pointer to the internal default character tables within PCRE.
1227           The fourth argument should point to an unsigned char *  variable.  This
1228           information call is provided for internal use by the pcre_study() func-
1229           tion. External callers can cause PCRE to use  its  internal  tables  by
1230           passing a NULL table pointer.
1231    
1232           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1233    
1234         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1235         non-anchored    pattern.    (This    option    used    to   be   called         non-anchored   pattern.   (This    option    used    to    be    called
1236         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         PCRE_INFO_FIRSTCHAR;  the  old  name  is still recognized for backwards
1237         compatibility.)         compatibility.)
1238    
1239         If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as         If there is a fixed first byte, for example, from  a  pattern  such  as
1240         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.         (cat|cow|coyote),  it  is  returned in the integer pointed to by where.
1241         Otherwise, if either         Otherwise, if either
1242    
1243         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1244         branch starts with "^", or         branch starts with "^", or
1245    
1246         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1247         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1248    
1249         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
1250         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
1251         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1252    
1253           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1254    
1255         If  the pattern was studied, and this resulted in the construction of a         If the pattern was studied, and this resulted in the construction of  a
1256         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1257         matching  string, a pointer to the table is returned. Otherwise NULL is         matching string, a pointer to the table is returned. Otherwise NULL  is
1258         returned. The fourth argument should point to an unsigned char *  vari-         returned.  The fourth argument should point to an unsigned char * vari-
1259         able.         able.
1260    
1261           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1262    
1263         Return  the  value of the rightmost literal byte that must exist in any         Return the value of the rightmost literal byte that must exist  in  any
1264         matched string, other than at its  start,  if  such  a  byte  has  been         matched  string,  other  than  at  its  start,  if such a byte has been
1265         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1266         is no such byte, -1 is returned. For anchored patterns, a last  literal         is  no such byte, -1 is returned. For anchored patterns, a last literal
1267         byte  is  recorded only if it follows something of variable length. For         byte is recorded only if it follows something of variable  length.  For
1268         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1269         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1270    
# Line 860  INFORMATION ABOUT A PATTERN Line 1272  INFORMATION ABOUT A PATTERN
1272           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1273           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1274    
1275         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE supports the use of named as well as numbered capturing  parenthe-
1276         ses. The names are just an additional way of identifying the  parenthe-         ses.  The names are just an additional way of identifying the parenthe-
1277         ses,  which still acquire a number. A caller that wants to extract data         ses,  which  still  acquire  numbers.  A  convenience  function  called
1278         from a named subpattern must convert the name to a number in  order  to         pcre_get_named_substring()  is  provided  for  extracting an individual
1279         access  the  correct  pointers  in  the  output  vector (described with         captured substring by name. It is also possible  to  extract  the  data
1280         pcre_exec() below). In order to do this, it must first use these  three         directly,  by  first converting the name to a number in order to access
1281         values to obtain the name-to-number mapping table for the pattern.         the correct pointers in the output vector (described  with  pcre_exec()
1282           below).  To  do the conversion, you need to use the name-to-number map,
1283           which is described by these three values.
1284    
1285         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1286         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1287         of  each  entry;  both  of  these  return  an int value. The entry size         of each entry; both of these  return  an  int  value.  The  entry  size
1288         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns
1289         a  pointer  to  the  first  entry of the table (a pointer to char). The         a pointer to the first entry of the table  (a  pointer  to  char).  The
1290         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1291         sis,  most  significant byte first. The rest of the entry is the corre-         sis, most significant byte first. The rest of the entry is  the  corre-
1292         sponding name, zero terminated. The names are  in  alphabetical  order.         sponding  name,  zero  terminated. The names are in alphabetical order.
1293         For  example,  consider  the following pattern (assume PCRE_EXTENDED is         For example, consider the following pattern  (assume  PCRE_EXTENDED  is
1294         set, so white space - including newlines - is ignored):         set, so white space - including newlines - is ignored):
1295    
1296           (?P<date> (?P<year>(\d\d)?\d\d) -           (?P<date> (?P<year>(\d\d)?\d\d) -
1297           (?P<month>\d\d) - (?P<day>\d\d) )           (?P<month>\d\d) - (?P<day>\d\d) )
1298    
1299         There are four named subpatterns, so the table has  four  entries,  and         There  are  four  named subpatterns, so the table has four entries, and
1300         each  entry  in the table is eight bytes long. The table is as follows,         each entry in the table is eight bytes long. The table is  as  follows,
1301         with non-printing bytes shows in hex, and undefined bytes shown as ??:         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1302           as ??:
1303    
1304           00 01 d  a  t  e  00 ??           00 01 d  a  t  e  00 ??
1305           00 05 d  a  y  00 ?? ??           00 05 d  a  y  00 ?? ??
1306           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1307           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1308    
1309         When writing code to extract data from named subpatterns, remember that         When writing code to extract data  from  named  subpatterns  using  the
1310         the length of each entry may be different for each compiled pattern.         name-to-number map, remember that the length of each entry is likely to
1311           be different for each compiled pattern.
1312    
1313           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1314    
1315         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1316         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1317         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1318         by any top-level option settings within the pattern itself.         by any top-level option settings within the pattern itself.
1319    
1320         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1321         alternatives begin with one of the following:         alternatives begin with one of the following:
1322    
1323           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 915  INFORMATION ABOUT A PATTERN Line 1331  INFORMATION ABOUT A PATTERN
1331    
1332           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1333    
1334         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1335         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1336         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1337         size_t variable.         size_t variable.
1338    
1339           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1340    
1341         Returns  the  size of the data block pointed to by the study_data field         Return the size of the data block pointed to by the study_data field in
1342         in a pcre_extra block. That is, it is the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1343         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1344         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1345         variable.         variable.
1346    
1347    
# Line 933  OBSOLETE INFO FUNCTION Line 1349  OBSOLETE INFO FUNCTION
1349    
1350         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1351    
1352         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1353         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1354         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1355         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1356         lowing negative numbers:         lowing negative numbers:
1357    
1358           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1359           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1360    
1361         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1362         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1363         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1364    
1365         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1366         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1367         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1368    
1369    
1370  MATCHING A PATTERN  REFERENCE COUNTS
1371    
1372           int pcre_refcount(pcre *code, int adjust);
1373    
1374           The pcre_refcount() function is used to maintain a reference  count  in
1375           the data block that contains a compiled pattern. It is provided for the
1376           benefit of applications that  operate  in  an  object-oriented  manner,
1377           where different parts of the application may be using the same compiled
1378           pattern, but you want to free the block when they are all done.
1379    
1380           When a pattern is compiled, the reference count field is initialized to
1381           zero.   It is changed only by calling this function, whose action is to
1382           add the adjust value (which may be positive or  negative)  to  it.  The
1383           yield of the function is the new value. However, the value of the count
1384           is constrained to lie between 0 and 65535, inclusive. If the new  value
1385           is outside these limits, it is forced to the appropriate limit value.
1386    
1387           Except  when it is zero, the reference count is not correctly preserved
1388           if a pattern is compiled on one host and then  transferred  to  a  host
1389           whose byte-order is different. (This seems a highly unlikely scenario.)
1390    
1391    
1392    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1393    
1394         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1395              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1396              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1397    
1398         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1399         pre-compiled pattern, which is passed in the code argument. If the pat-         compiled  pattern, which is passed in the code argument. If the pattern
1400         tern  has been studied, the result of the study should be passed in the         has been studied, the result of the study should be passed in the extra
1401         extra argument.         argument.  This  function is the main matching facility of the library,
1402           and it operates in a Perl-like manner. For specialist use there is also
1403           an  alternative matching function, which is described below in the sec-
1404           tion about the pcre_dfa_exec() function.
1405    
1406           In most applications, the pattern will have been compiled (and  option-
1407           ally  studied)  in the same process that calls pcre_exec(). However, it
1408           is possible to save compiled patterns and study data, and then use them
1409           later  in  different processes, possibly even on different hosts. For a
1410           discussion about this, see the pcreprecompile documentation.
1411    
1412         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
1413    
# Line 973  MATCHING A PATTERN Line 1420  MATCHING A PATTERN
1420             11,             /* the length of the subject string */             11,             /* the length of the subject string */
1421             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1422             0,              /* default options */             0,              /* default options */
1423             ovector,        /* vector for substring information */             ovector,        /* vector of integers for substring information */
1424             30);            /* number of elements in the vector */             30);            /* number of elements (NOT size in bytes) */
1425    
1426       Extra data for pcre_exec()
1427    
1428         If the extra argument is not NULL, it must point to a  pcre_extra  data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1429         block.  The pcre_study() function returns such a block (when it doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1430         return NULL), but you can also create one for yourself, and pass  addi-         return NULL), but you can also create one for yourself, and pass  addi-
1431         tional information in it. The fields in the block are as follows:         tional  information in it. The fields in a pcre_extra block are as fol-
1432           lows:
1433    
1434           unsigned long int flags;           unsigned long int flags;
1435           void *study_data;           void *study_data;
1436           unsigned long int match_limit;           unsigned long int match_limit;
1437           void *callout_data;           void *callout_data;
1438             const unsigned char *tables;
1439    
1440         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1441         are set. The flag bits are:         are set. The flag bits are:
1442    
1443           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1444           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1445           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1446             PCRE_EXTRA_TABLES
1447    
1448         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1449         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1450         the appropriate flag bit. You should not set this yourself, but you can         the appropriate flag bit. You should not set this yourself, but you may
1451         add to the block by setting the other fields.         add to the block by setting the other fields  and  their  corresponding
1452           flag bits.
1453    
1454         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1455         a vast amount of resources when running patterns that are not going  to         a vast amount of resources when running patterns that are not going  to
1456         match,  but  which  have  a very large number of possibilities in their         match,  but  which  have  a very large number of possibilities in their
1457         search trees. The classic  example  is  the  use  of  nested  unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1458         repeats. Internally, PCRE uses a function called match() which it calls         repeats.
1459         repeatedly (sometimes recursively). The limit is imposed on the  number  
1460         of  times  this function is called during a match, which has the effect         Internally,  PCRE uses a function called match() which it calls repeat-
1461         of limiting the amount of recursion  and  backtracking  that  can  take         edly (sometimes recursively). The limit is imposed  on  the  number  of
1462         place.  For  patterns that are not anchored, the count starts from zero         times  this  function is called during a match, which has the effect of
1463         for each position in the subject string.         limiting the amount of recursion and backtracking that can take  place.
1464           For patterns that are not anchored, the count starts from zero for each
1465           position in the subject string.
1466    
1467         The default limit for the library can be set when PCRE  is  built;  the         The default limit for the library can be set when PCRE  is  built;  the
1468         default  default  is 10 million, which handles all but the most extreme         default  default  is 10 million, which handles all but the most extreme
# Line 1019  MATCHING A PATTERN Line 1474  MATCHING A PATTERN
1474         The  pcre_callout  field is used in conjunction with the "callout" fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1475         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1476    
1477         The PCRE_ANCHORED option can be passed in the options  argument,  whose         The tables field  is  used  to  pass  a  character  tables  pointer  to
1478         unused  bits  must  be zero. This limits pcre_exec() to matching at the         pcre_exec();  this overrides the value that is stored with the compiled
1479         first matching position.  However,  if  a  pattern  was  compiled  with         pattern. A non-NULL value is stored with the compiled pattern  only  if
1480         PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1481         it cannot be made unachored at matching time.         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1482           PCRE's  internal  tables  to be used. This facility is helpful when re-
1483         When PCRE_UTF8 was set at compile time, the validity of the subject  as         using patterns that have been saved after compiling  with  an  external
1484         a  UTF-8  string is automatically checked, and the value of startoffset         set  of  tables,  because  the  external tables might be at a different
1485         is also checked to ensure that it points to the start of a UTF-8  char-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1486         acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()         tion for a discussion of saving compiled patterns for later use.
1487         returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an  
1488         invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.     Option bits for pcre_exec()
1489    
1490         If  you  already  know that your subject is valid, and you want to skip         The  unused  bits of the options argument for pcre_exec() must be zero.
1491         these   checks   for   performance   reasons,   you   can    set    the         The  only  bits  that  may  be  set  are  PCRE_ANCHORED,   PCRE_NOTBOL,
1492         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
1493         do this for the second and subsequent calls to pcre_exec() if  you  are  
1494         making  repeated  calls  to  find  all  the matches in a single subject           PCRE_ANCHORED
        string. However, you should be  sure  that  the  value  of  startoffset  
        points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is  
        set, the effect of passing an invalid UTF-8 string as a subject,  or  a  
        value  of startoffset that does not point to the start of a UTF-8 char-  
        acter, is undefined. Your program may crash.  
1495    
1496         There are also three further options that can be set only  at  matching         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
1497         time:         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
1498           turned  out to be anchored by virtue of its contents, it cannot be made
1499           unachored at matching time.
1500    
1501           PCRE_NOTBOL           PCRE_NOTBOL
1502    
1503         The  first  character  of the string is not the beginning of a line, so         This option specifies that first character of the subject string is not
1504         the circumflex metacharacter should not match before it.  Setting  this         the  beginning  of  a  line, so the circumflex metacharacter should not
1505         without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to         match before it. Setting this without PCRE_MULTILINE (at compile  time)
1506         match.         causes  circumflex  never to match. This option affects only the behav-
1507           iour of the circumflex metacharacter. It does not affect \A.
1508    
1509           PCRE_NOTEOL           PCRE_NOTEOL
1510    
1511         The end of the string is not the end of a line, so the dollar metachar-         This option specifies that the end of the subject string is not the end
1512         acter  should  not  match  it  nor (except in multiline mode) a newline         of  a line, so the dollar metacharacter should not match it nor (except
1513         immediately before it. Setting this without PCRE_MULTILINE (at  compile         in multiline mode) a newline immediately before it. Setting this  with-
1514         time) causes dollar never to match.         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1515           option affects only the behaviour of the dollar metacharacter. It  does
1516           not affect \Z or \z.
1517    
1518           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1519    
# Line 1078  MATCHING A PATTERN Line 1533  MATCHING A PATTERN
1533         cial case of a pattern match of the empty  string  within  its  split()         cial case of a pattern match of the empty  string  within  its  split()
1534         function,  and  when  using  the /g modifier. It is possible to emulate         function,  and  when  using  the /g modifier. It is possible to emulate
1535         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1536         again at the same offset with PCRE_NOTEMPTY set, and then if that fails         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1537         by advancing the starting offset (see below)  and  trying  an  ordinary         if that fails by advancing the starting offset (see below)  and  trying
1538         match again.         an ordinary match again. There is some code that demonstrates how to do
1539           this in the pcredemo.c sample program.
1540         The  subject string is passed to pcre_exec() as a pointer in subject, a  
1541         length in length, and a starting byte offset in startoffset. Unlike the           PCRE_NO_UTF8_CHECK
1542         pattern  string,  the  subject  may contain binary zero bytes. When the  
1543         starting offset is zero, the search for a match starts at the beginning         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1544         of the subject, and this is by far the most common case.         UTF-8  string is automatically checked when pcre_exec() is subsequently
1545           called.  The value of startoffset is also checked  to  ensure  that  it
1546         If the pattern was compiled with the PCRE_UTF8 option, the subject must         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence
1547         be a sequence of bytes that is a valid UTF-8 string, and  the  starting         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1548         offset  must point to the beginning of a UTF-8 character. If an invalid         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is
1549         UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8         returned.
1550         or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option  
1551         PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not         If you already know that your subject is valid, and you  want  to  skip
1552         defined.         these    checks    for   performance   reasons,   you   can   set   the
1553           PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
1554           do  this  for the second and subsequent calls to pcre_exec() if you are
1555           making repeated calls to find all  the  matches  in  a  single  subject
1556           string.  However,  you  should  be  sure  that the value of startoffset
1557           points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is
1558           set,  the  effect of passing an invalid UTF-8 string as a subject, or a
1559           value of startoffset that does not point to the start of a UTF-8  char-
1560           acter, is undefined. Your program may crash.
1561    
1562             PCRE_PARTIAL
1563    
1564           This  option  turns  on  the  partial  matching feature. If the subject
1565           string fails to match the pattern, but at some point during the  match-
1566           ing  process  the  end of the subject was reached (that is, the subject
1567           partially matches the pattern and the failure to  match  occurred  only
1568           because  there were not enough subject characters), pcre_exec() returns
1569           PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is
1570           used,  there  are restrictions on what may appear in the pattern. These
1571           are discussed in the pcrepartial documentation.
1572    
1573       The string to be matched by pcre_exec()
1574    
1575           The subject string is passed to pcre_exec() as a pointer in subject,  a
1576           length  in  length, and a starting byte offset in startoffset. In UTF-8
1577           mode, the byte offset must point to the start  of  a  UTF-8  character.
1578           Unlike  the  pattern string, the subject may contain binary zero bytes.
1579           When the starting offset is zero, the search for a match starts at  the
1580           beginning of the subject, and this is by far the most common case.
1581    
1582         A  non-zero  starting offset is useful when searching for another match         A  non-zero  starting offset is useful when searching for another match
1583         in the same subject by calling pcre_exec() again after a previous  suc-         in the same subject by calling pcre_exec() again after a previous  suc-
# Line 1111  MATCHING A PATTERN Line 1594  MATCHING A PATTERN
1594         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
1595         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1596         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1597         string again, but with startoffset  set  to  4,  it  finds  the  second         string again, but with startoffset set to 4, it finds the second occur-
1598         occurrence  of  "iss"  because  it  is able to look behind the starting         rence  of "iss" because it is able to look behind the starting point to
1599         point to discover that it is preceded by a letter.         discover that it is preceded by a letter.
1600    
1601         If a non-zero starting offset is passed when the pattern  is  anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
1602         one  attempt  to match at the given offset is tried. This can only suc-         one attempt to match at the given offset is made. This can only succeed
1603         ceed if the pattern does not require the match to be at  the  start  of         if the pattern does not require the match to be at  the  start  of  the
1604         the subject.         subject.
1605    
1606       How pcre_exec() returns captured substrings
1607    
1608         In  general, a pattern matches a certain portion of the subject, and in         In  general, a pattern matches a certain portion of the subject, and in
1609         addition, further substrings from the subject  may  be  picked  out  by         addition, further substrings from the subject  may  be  picked  out  by
# Line 1130  MATCHING A PATTERN Line 1615  MATCHING A PATTERN
1615    
1616         Captured  substrings are returned to the caller via a vector of integer         Captured  substrings are returned to the caller via a vector of integer
1617         offsets whose address is passed in ovector. The number of  elements  in         offsets whose address is passed in ovector. The number of  elements  in
1618         the vector is passed in ovecsize. The first two-thirds of the vector is         the  vector is passed in ovecsize, which must be a non-negative number.
1619         used to pass back captured substrings, each substring using a  pair  of         Note: this argument is NOT the size of ovector in bytes.
1620         integers.  The  remaining  third  of the vector is used as workspace by  
1621         pcre_exec() while matching capturing subpatterns, and is not  available         The first two-thirds of the vector is used to pass back  captured  sub-
1622         for  passing  back  information.  The  length passed in ovecsize should         strings,  each  substring using a pair of integers. The remaining third
1623         always be a multiple of three. If it is not, it is rounded down.         of the vector is used as workspace by pcre_exec() while  matching  cap-
1624           turing  subpatterns, and is not available for passing back information.
1625         When a match has been successful, information about captured substrings         The length passed in ovecsize should always be a multiple of three.  If
1626         is returned in pairs of integers, starting at the beginning of ovector,         it is not, it is rounded down.
1627         and continuing up to two-thirds of its length at the  most.  The  first  
1628           When  a  match  is successful, information about captured substrings is
1629           returned in pairs of integers, starting at the  beginning  of  ovector,
1630           and  continuing  up  to two-thirds of its length at the most. The first
1631         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1632         string, and the second is set to the  offset  of  the  first  character         string,  and  the  second  is  set to the offset of the first character
1633         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-
1634         tor[1], identify the portion of  the  subject  string  matched  by  the         tor[1],  identify  the  portion  of  the  subject string matched by the
1635         entire  pattern.  The next pair is used for the first capturing subpat-         entire pattern. The next pair is used for the first  capturing  subpat-
1636         tern, and so on. The value returned by pcre_exec()  is  the  number  of         tern,  and  so  on.  The value returned by pcre_exec() is the number of
1637         pairs  that  have  been set. If there are no capturing subpatterns, the         pairs that have been set. If there are no  capturing  subpatterns,  the
1638         return value from a successful match is 1,  indicating  that  just  the         return  value  from  a  successful match is 1, indicating that just the
1639         first pair of offsets has been set.         first pair of offsets has been set.
1640    
1641         Some  convenience  functions  are  provided for extracting the captured         Some convenience functions are provided  for  extracting  the  captured
1642         substrings as separate strings. These are described  in  the  following         substrings  as  separate  strings. These are described in the following
1643         section.         section.
1644    
1645         It  is  possible  for  an capturing subpattern number n+1 to match some         It is possible for an capturing subpattern number  n+1  to  match  some
1646         part of the subject when subpattern n has not been  used  at  all.  For         part  of  the  subject  when subpattern n has not been used at all. For
1647         example, if the string "abc" is matched against the pattern (a|(z))(bc)         example, if the string "abc" is matched against the pattern (a|(z))(bc)
1648         subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both         subpatterns  1 and 3 are matched, but 2 is not. When this happens, both
1649         offset values corresponding to the unused subpattern are set to -1.         offset values corresponding to the unused subpattern are set to -1.
1650    
1651         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1652         of the string that it matched that gets returned.         of the string that it matched that is returned.
1653    
1654         If the vector is too small to hold all the captured substrings,  it  is         If  the vector is too small to hold all the captured substring offsets,
1655         used as far as possible (up to two-thirds of its length), and the func-         it is used as far as possible (up to two-thirds of its length), and the
1656         tion returns a value of zero. In particular, if the  substring  offsets         function  returns a value of zero. In particular, if the substring off-
1657         are  not  of interest, pcre_exec() may be called with ovector passed as         sets are not of interest, pcre_exec() may be called with ovector passed
1658         NULL and ovecsize as zero. However, if the pattern contains back refer-         as  NULL  and  ovecsize  as zero. However, if the pattern contains back
1659         ences  and  the  ovector  isn't big enough to remember the related sub-         references and the ovector is not big enough to  remember  the  related
1660         strings, PCRE has to get additional memory  for  use  during  matching.         substrings,  PCRE has to get additional memory for use during matching.
1661         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1662    
1663         Note  that  pcre_info() can be used to find out how many capturing sub-         Note that pcre_info() can be used to find out how many  capturing  sub-
1664         patterns there are in a compiled pattern. The smallest size for ovector         patterns there are in a compiled pattern. The smallest size for ovector
1665         that  will  allow for n captured substrings, in addition to the offsets         that will allow for n captured substrings, in addition to  the  offsets
1666         of the substring matched by the whole pattern, is (n+1)*3.         of the substring matched by the whole pattern, is (n+1)*3.
1667    
1668         If pcre_exec() fails, it returns a negative number. The  following  are     Return values from pcre_exec()
1669    
1670           If  pcre_exec()  fails, it returns a negative number. The following are
1671         defined in the header file:         defined in the header file:
1672    
1673           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1186  MATCHING A PATTERN Line 1676  MATCHING A PATTERN
1676    
1677           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1678    
1679         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1680         ovecsize was not zero.         ovecsize was not zero.
1681    
1682           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1195  MATCHING A PATTERN Line 1685  MATCHING A PATTERN
1685    
1686           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1687    
1688         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1689         to  catch  the case when it is passed a junk pointer. This is the error         to catch the case when it is passed a junk pointer and to detect when a
1690         it gives when the magic number isn't present.         pattern that was compiled in an environment of one endianness is run in
1691           an environment with the other endianness. This is the error  that  PCRE
1692           gives when the magic number is not present.
1693    
1694           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_NODE   (-5)
1695    
1696         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1697         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1698         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1699    
1700           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1701    
1702         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
1703         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1704         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
1705         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
1706         memory is freed at the end of matching.         memory is automatically freed at the end of matching.
1707    
1708           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1709    
1710         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1711         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1712         returned by pcre_exec().         returned by pcre_exec().
1713    
1714           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1715    
1716         The recursion and backtracking limit, as specified by  the  match_limit         The  recursion  and backtracking limit, as specified by the match_limit
1717         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
1718         description above.         description above.
1719    
1720           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
1721    
1722         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
1723         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
1724         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
1725    
1726           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
1727    
1728         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
1729         subject.         subject.
1730    
1731           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
1732    
1733         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
1734         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
1735         ter.         ter.
1736    
1737             PCRE_ERROR_PARTIAL        (-12)
1738    
1739           The subject string did not match, but it did match partially.  See  the
1740           pcrepartial documentation for details of partial matching.
1741    
1742             PCRE_ERROR_BADPARTIAL     (-13)
1743    
1744           The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
1745           items that are not supported for partial matching. See the  pcrepartial
1746           documentation for details of partial matching.
1747    
1748             PCRE_ERROR_INTERNAL       (-14)
1749    
1750           An  unexpected  internal error has occurred. This error could be caused
1751           by a bug in PCRE or by overwriting of the compiled pattern.
1752    
1753             PCRE_ERROR_BADCOUNT       (-15)
1754    
1755           This error is given if the value of the ovecsize argument is  negative.
1756    
1757    
1758  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1759    
# Line 1267  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1779  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1779         not, of course, a C string.         not, of course, a C string.
1780    
1781         The  first  three  arguments  are the same for all three of these func-         The  first  three  arguments  are the same for all three of these func-
1782         tions: subject is the subject string which has just  been  successfully         tions: subject is the subject string that has  just  been  successfully
1783         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
1784         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
1785         were  captured  by  the match, including the substring that matched the         were  captured  by  the match, including the substring that matched the
1786         entire regular expression. This is the value returned by  pcre_exec  if         entire regular expression. This is the value returned by pcre_exec() if
1787         it  is greater than zero. If pcre_exec() returned zero, indicating that         it  is greater than zero. If pcre_exec() returned zero, indicating that
1788         it ran out of space in ovector, the value passed as stringcount  should         it ran out of space in ovector, the value passed as stringcount  should
1789         be the size of the vector divided by three.         be the number of elements in the vector divided by three.
1790    
1791         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
1792         single substring, whose number is given as  stringnumber.  A  value  of         single substring, whose number is given as  stringnumber.  A  value  of
1793         zero  extracts  the  substring  that  matched the entire pattern, while         zero  extracts  the  substring that matched the entire pattern, whereas
1794         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
1795         string(),  the  string  is  placed  in buffer, whose length is given by         string(),  the  string  is  placed  in buffer, whose length is given by
1796         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
# Line 1297  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1809  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1809    
1810         The pcre_get_substring_list()  function  extracts  all  available  sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
1811         strings  and  builds  a list of pointers to them. All this is done in a         strings  and  builds  a list of pointers to them. All this is done in a
1812         single block of memory which is obtained via pcre_malloc.  The  address         single block of memory that is obtained via pcre_malloc. The address of
1813         of the memory block is returned via listptr, which is also the start of         the  memory  block  is returned via listptr, which is also the start of
1814         the list of string pointers. The end of the list is marked  by  a  NULL         the list of string pointers. The end of the list is marked  by  a  NULL
1815         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all went well, or
1816    
# Line 1313  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1825  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1825         string  by inspecting the appropriate offset in ovector, which is nega-         string  by inspecting the appropriate offset in ovector, which is nega-
1826         tive for unset substrings.         tive for unset substrings.
1827    
1828         The    two    convenience    functions    pcre_free_substring()     and         The two convenience functions pcre_free_substring() and  pcre_free_sub-
1829         pcre_free_substring_list() can be used to free the memory returned by a         string_list()  can  be  used  to free the memory returned by a previous
1830         previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1831         respectively. They do nothing more than call the function pointed to by         tively.  They  do  nothing  more  than  call the function pointed to by
1832         pcre_free, which of course could be called directly from a  C  program.         pcre_free, which of course could be called directly from a  C  program.
1833         However,  PCRE is used in some situations where it is linked via a spe-         However,  PCRE is used in some situations where it is linked via a spe-
1834         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  which  cannot  use
# Line 1326  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1838  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1838    
1839  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1840    
1841           int pcre_get_stringnumber(const pcre *code,
1842                const char *name);
1843    
1844         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
1845              const char *subject, int *ovector,              const char *subject, int *ovector,
1846              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1847              char *buffer, int buffersize);              char *buffer, int buffersize);
1848    
        int pcre_get_stringnumber(const pcre *code,  
             const char *name);  
   
1849         int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
1850              const char *subject, int *ovector,              const char *subject, int *ovector,
1851              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1852              const char **stringptr);              const char **stringptr);
1853    
1854         To extract a substring by name, you first have to find associated  num-         To extract a substring by name, you first have to find associated  num-
1855         ber.  This  can  be  done by calling pcre_get_stringnumber(). The first         ber.  For example, for this pattern
1856         argument is the compiled pattern, and the second is the name. For exam-  
1857         ple, for this pattern           (a+)b(?P<xxx>\d+)...
1858    
1859           ab(?<xxx>\d+)...         the number of the subpattern called "xxx" is 2. You can find the number
1860           from the name by calling pcre_get_stringnumber(). The first argument is
1861         the  number  of the subpattern called "xxx" is 1. Given the number, you         the  compiled  pattern,  and  the  second is the name. The yield of the
1862         can then extract the substring directly, or use one  of  the  functions         function is the subpattern number, or  PCRE_ERROR_NOSUBSTRING  (-7)  if
1863         described  in the previous section. For convenience, there are also two         there is no subpattern of that name.
1864         functions that do the whole job.  
1865           Given the number, you can extract the substring directly, or use one of
1866         Most   of   the   arguments    of    pcre_copy_named_substring()    and         the functions described in the previous section. For convenience, there
1867         pcre_get_named_substring() are the same as those for the functions that         are also two functions that do the whole job.
1868         extract by number, and so are not re-described here. There are just two  
1869         differences.         Most    of    the    arguments   of   pcre_copy_named_substring()   and
1870           pcre_get_named_substring() are the same  as  those  for  the  similarly
1871           named  functions  that extract by number. As these are described in the
1872           previous section, they are not re-described here. There  are  just  two
1873           differences:
1874    
1875         First,  instead  of a substring number, a substring name is given. Sec-         First,  instead  of a substring number, a substring name is given. Sec-
1876         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
# Line 1365  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1881  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1881         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
1882         ate.         ate.
1883    
 Last updated: 09 December 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
1884    
1885  PCRE(3)                                                                PCRE(3)  FINDING ALL POSSIBLE MATCHES
1886    
1887           The traditional matching function uses a  similar  algorithm  to  Perl,
1888           which stops when it finds the first match, starting at a given point in
1889           the subject. If you want to find all possible matches, or  the  longest
1890           possible  match,  consider using the alternative matching function (see
1891           below) instead. If you cannot use the alternative function,  but  still
1892           need  to  find all possible matches, you can kludge it up by making use
1893           of the callout facility, which is described in the pcrecallout documen-
1894           tation.
1895    
1896           What you have to do is to insert a callout right at the end of the pat-
1897           tern.  When your callout function is called, extract and save the  cur-
1898           rent  matched  substring.  Then  return  1, which forces pcre_exec() to
1899           backtrack and try other alternatives. Ultimately, when it runs  out  of
1900           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
1901    
1902    
1903    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
1904    
1905           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1906                const char *subject, int length, int startoffset,
1907                int options, int *ovector, int ovecsize,
1908                int *workspace, int wscount);
1909    
1910           The  function  pcre_dfa_exec()  is  called  to  match  a subject string
1911           against a compiled pattern, using a "DFA" matching algorithm. This  has
1912           different  characteristics to the normal algorithm, and is not compati-
1913           ble with Perl. Some of the features of PCRE patterns are not supported.
1914           Nevertheless, there are times when this kind of matching can be useful.
1915           For a discussion of the two matching algorithms, see  the  pcrematching
1916           documentation.
1917    
1918           The  arguments  for  the  pcre_dfa_exec()  function are the same as for
1919           pcre_exec(), plus two extras. The ovector argument is used in a differ-
1920           ent  way,  and  this is described below. The other common arguments are
1921           used in the same way as for pcre_exec(), so their  description  is  not
1922           repeated here.
1923    
1924           The  two  additional  arguments provide workspace for the function. The
1925           workspace vector should contain at least 20 elements. It  is  used  for
1926           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
1927           workspace will be needed for patterns and subjects where  there  are  a
1928           lot of possible matches.
1929    
1930           Here is an example of a simple call to pcre_exec():
1931    
1932             int rc;
1933             int ovector[10];
1934             int wspace[20];
1935             rc = pcre_exec(
1936               re,             /* result of pcre_compile() */
1937               NULL,           /* we didn't study the pattern */
1938               "some string",  /* the subject string */
1939               11,             /* the length of the subject string */
1940               0,              /* start at offset 0 in the subject */
1941               0,              /* default options */
1942               ovector,        /* vector of integers for substring information */
1943               10,             /* number of elements (NOT size in bytes) */
1944               wspace,         /* working space vector */
1945               20);            /* number of elements (NOT size in bytes) */
1946    
1947       Option bits for pcre_dfa_exec()
1948    
1949           The  unused  bits  of  the options argument for pcre_dfa_exec() must be
1950           zero. The only bits that may be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,
1951           PCRE_NOTEOL,     PCRE_NOTEMPTY,    PCRE_NO_UTF8_CHECK,    PCRE_PARTIAL,
1952           PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All  but  the  last  three  of
1953           these  are  the  same  as  for pcre_exec(), so their description is not
1954           repeated here.
1955    
1956             PCRE_PARTIAL
1957    
1958           This has the same general effect as it does for  pcre_exec(),  but  the
1959           details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
1960           pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
1961           PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
1962           been no complete matches, but there is still at least one matching pos-
1963           sibility.  The portion of the string that provided the partial match is
1964           set as the first matching string.
1965    
1966             PCRE_DFA_SHORTEST
1967    
1968           Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
1969           stop  as  soon  as  it  has found one match. Because of the way the DFA
1970           algorithm works, this is necessarily the shortest possible match at the
1971           first possible matching point in the subject string.
1972    
1973             PCRE_DFA_RESTART
1974    
1975           When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
1976           returns a partial match, it is possible to call it  again,  with  addi-
1977           tional  subject  characters,  and have it continue with the same match.
1978           The PCRE_DFA_RESTART option requests this action; when it is  set,  the
1979           workspace  and wscount options must reference the same vector as before
1980           because data about the match so far is left in  them  after  a  partial
1981           match.  There  is  more  discussion of this facility in the pcrepartial
1982           documentation.
1983    
1984       Successful returns from pcre_dfa_exec()
1985    
1986           When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
1987           string in the subject. Note, however, that all the matches from one run
1988           of the function start at the same point in  the  subject.  The  shorter
1989           matches  are all initial substrings of the longer matches. For example,
1990           if the pattern
1991    
1992             <.*>
1993    
1994           is matched against the string
1995    
1996             This is <something> <something else> <something further> no more
1997    
1998           the three matched strings are
1999    
2000             <something>
2001             <something> <something else>
2002             <something> <something else> <something further>
2003    
2004           On success, the yield of the function is a number  greater  than  zero,
2005           which  is  the  number of matched substrings. The substrings themselves
2006           are returned in ovector. Each string uses two elements;  the  first  is
2007           the  offset  to the start, and the second is the offset to the end. All
2008           the strings have the same start offset. (Space could have been saved by
2009           giving  this only once, but it was decided to retain some compatibility
2010           with the way pcre_exec() returns data, even though the meaning  of  the
2011           strings is different.)
2012    
2013           The strings are returned in reverse order of length; that is, the long-
2014           est matching string is given first. If there were too many  matches  to
2015           fit  into ovector, the yield of the function is zero, and the vector is
2016           filled with the longest matches.
2017    
2018       Error returns from pcre_dfa_exec()
2019    
2020           The pcre_dfa_exec() function returns a negative number when  it  fails.
2021           Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2022           described above.  There are in addition the following errors  that  are
2023           specific to pcre_dfa_exec():
2024    
2025             PCRE_ERROR_DFA_UITEM      (-16)
2026    
2027           This  return is given if pcre_dfa_exec() encounters an item in the pat-
2028           tern that it does not support, for instance, the use of \C  or  a  back
2029           reference.
2030    
2031             PCRE_ERROR_DFA_UCOND      (-17)
2032    
2033           This  return is given if pcre_dfa_exec() encounters a condition item in
2034           a pattern that uses a back reference for the  condition.  This  is  not
2035           supported.
2036    
2037             PCRE_ERROR_DFA_UMLIMIT    (-18)
2038    
2039           This  return  is given if pcre_dfa_exec() is called with an extra block
2040           that contains a setting of the match_limit field. This is not supported
2041           (it is meaningless).
2042    
2043             PCRE_ERROR_DFA_WSSIZE     (-19)
2044    
2045           This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2046           workspace vector.
2047    
2048             PCRE_ERROR_DFA_RECURSE    (-20)
2049    
2050           When a recursive subpattern is processed, the matching  function  calls
2051           itself  recursively,  using  private vectors for ovector and workspace.
2052           This error is given if the output vector  is  not  large  enough.  This
2053           should be extremely rare, as a vector of size 1000 is used.
2054    
2055    Last updated: 16 May 2005
2056    Copyright (c) 1997-2005 University of Cambridge.
2057    ------------------------------------------------------------------------------
2058    
2059    
2060    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2061    
2062    
2063  NAME  NAME
2064         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2065    
2066    
2067  PCRE CALLOUTS  PCRE CALLOUTS
2068    
2069         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
# Line 1392  PCRE CALLOUTS Line 2080  PCRE CALLOUTS
2080         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
2081         points:         points:
2082    
2083           (?C1)abc(?C2)def           (?C1)eabc(?C2)def
2084    
2085         During matching, when PCRE reaches a callout point (and pcre_callout is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2086         set), the external function is called. Its only argument is  a  pointer         called, PCRE automatically  inserts  callouts,  all  with  number  255,
2087         to a pcre_callout block. This contains the following variables:         before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
2088           used with the pattern
2089    
2090             A(\d{2}|--)
2091    
2092           it is processed as if it were
2093    
2094           (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2095    
2096           Notice that there is a callout before and after  each  parenthesis  and
2097           alternation  bar.  Automatic  callouts  can  be  used  for tracking the
2098           progress of pattern matching. The pcretest command has an  option  that
2099           sets  automatic callouts; when it is used, the output indicates how the
2100           pattern is matched. This is useful information when you are  trying  to
2101           optimize the performance of a particular pattern.
2102    
2103    
2104    MISSING CALLOUTS
2105    
2106           You  should  be  aware  that,  because of optimizations in the way PCRE
2107           matches patterns, callouts sometimes do not happen. For example, if the
2108           pattern is
2109    
2110             ab(?C4)cd
2111    
2112           PCRE knows that any matching string must contain the letter "d". If the
2113           subject string is "abyz", the lack of "d" means that  matching  doesn't
2114           ever  start,  and  the  callout is never reached. However, with "abyd",
2115           though the result is still no match, the callout is obeyed.
2116    
2117    
2118    THE CALLOUT INTERFACE
2119    
2120           During matching, when PCRE reaches a callout point, the external  func-
2121           tion  defined by pcre_callout is called (if it is set). This applies to
2122           both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The
2123           only  argument  to  the callout function is a pointer to a pcre_callout
2124           block. This structure contains the following fields:
2125    
2126           int          version;           int          version;
2127           int          callout_number;           int          callout_number;
# Line 1408  PCRE CALLOUTS Line 2133  PCRE CALLOUTS
2133           int          capture_top;           int          capture_top;
2134           int          capture_last;           int          capture_last;
2135           void        *callout_data;           void        *callout_data;
2136             int          pattern_position;
2137             int          next_item_length;
2138    
2139         The  version  field  is an integer containing the version number of the         The version field is an integer containing the version  number  of  the
2140         block format. The current version  is  zero.  The  version  number  may         block  format. The initial version was 0; the current version is 1. The
2141         change  in  future if additional fields are added, but the intention is         version number will change again in future  if  additional  fields  are
2142         never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2143    
2144         The callout_number field contains the number of the  callout,  as  com-         The callout_number field contains the number of the  callout,  as  com-
2145         piled into the pattern (that is, the number after ?C).         piled  into  the pattern (that is, the number after ?C for manual call-
2146           outs, and 255 for automatically generated callouts).
2147    
2148         The  offset_vector field is a pointer to the vector of offsets that was         The offset_vector field is a pointer to the vector of offsets that  was
2149         passed by the caller to pcre_exec(). The contents can be  inspected  in         passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2150         order  to extract substrings that have been matched so far, in the same         pcre_exec() is used, the contents can be inspected in order to  extract
2151         way as for extracting substrings after a match has completed.         substrings  that  have  been  matched  so  far,  in the same way as for
2152           extracting substrings after a match has completed. For  pcre_dfa_exec()
2153           this field is not useful.
2154    
2155         The subject and subject_length fields contain copies  the  values  that         The subject and subject_length fields contain copies of the values that
2156         were passed to pcre_exec().         were passed to pcre_exec().
2157    
2158         The  start_match  field contains the offset within the subject at which         The start_match field contains the offset within the subject  at  which
2159         the current match attempt started. If the pattern is not anchored,  the         the  current match attempt started. If the pattern is not anchored, the
2160         callout  function  may  be  called several times for different starting         callout function may be called several times from the same point in the
2161         points.         pattern for different starting points in the subject.
2162    
2163         The current_position field contains the offset within  the  subject  of         The  current_position  field  contains the offset within the subject of
2164         the current match pointer.         the current match pointer.
2165    
2166         The  capture_top field contains one more than the number of the highest         When the pcre_exec() function is used, the capture_top  field  contains
2167         numbered  captured  substring  so  far.  If  no  substrings  have  been         one  more than the number of the highest numbered captured substring so
2168         captured, the value of capture_top is one.         far. If no substrings have been captured, the value of  capture_top  is
2169           one.  This  is always the case when pcre_dfa_exec() is used, because it
2170         The  capture_last  field  contains the number of the most recently cap-         does not support captured substrings.
2171         tured substring.  
2172           The capture_last field contains the number of the  most  recently  cap-
2173           tured  substring. If no substrings have been captured, its value is -1.
2174           This is always the case when pcre_dfa_exec() is used.
2175    
2176         The callout_data field contains a value that is passed  to  pcre_exec()         The callout_data field contains a value that is passed  to  pcre_exec()
2177         by  the  caller specifically so that it can be passed back in callouts.         or  pcre_dfa_exec() specifically so that it can be passed back in call-
2178         It is passed in the pcre_callout field of the  pcre_extra  data  struc-         outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
2179         ture.  If  no  such  data  was  passed,  the value of callout_data in a         structure.  If  no such data was passed, the value of callout_data in a
2180         pcre_callout block is NULL. There is a description  of  the  pcre_extra         pcre_callout block is NULL. There is a description  of  the  pcre_extra
2181         structure in the pcreapi documentation.         structure in the pcreapi documentation.
2182    
2183           The  pattern_position field is present from version 1 of the pcre_call-
2184           out structure. It contains the offset to the next item to be matched in
2185           the pattern string.
2186    
2187           The  next_item_length field is present from version 1 of the pcre_call-
2188           out structure. It contains the length of the next item to be matched in
2189           the  pattern  string. When the callout immediately precedes an alterna-
2190           tion bar, a closing parenthesis, or the end of the pattern, the  length
2191           is  zero.  When the callout precedes an opening parenthesis, the length
2192           is that of the entire subpattern.
2193    
2194           The pattern_position and next_item_length fields are intended  to  help
2195           in  distinguishing between different automatic callouts, which all have
2196           the same callout number. However, they are set for all callouts.
2197    
2198    
2199  RETURN VALUES  RETURN VALUES
2200    
2201         The callout function returns an integer. If the value is zero, matching         The external callout function returns an integer to PCRE. If the  value
2202         proceeds as normal. If the value is greater than zero,  matching  fails         is  zero,  matching  proceeds  as  normal. If the value is greater than
2203         at the current point, but backtracking to test other possibilities goes         zero, matching fails at the current point, but  the  testing  of  other
2204         ahead, just as if a lookahead assertion had failed.  If  the  value  is         matching possibilities goes ahead, just as if a lookahead assertion had
2205         less  than  zero,  the  match is abandoned, and pcre_exec() returns the         failed. If the value is less than zero, the  match  is  abandoned,  and
2206         value.         pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2207    
2208         Negative  values  should  normally  be   chosen   from   the   set   of         Negative   values   should   normally   be   chosen  from  the  set  of
2209         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2210         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
2211         reserved  for  use  by callout functions; it will never be used by PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2212         itself.         itself.
2213    
2214  Last updated: 21 January 2003  Last updated: 28 February 2005
2215  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
2216  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
2217    
 PCRE(3)                                                                PCRE(3)  
2218    
2219    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2220    
2221    
2222  NAME  NAME
2223         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2224    
2225  DIFFERENCES FROM PERL  
2226    DIFFERENCES BETWEEN PCRE AND PERL
2227    
2228         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2229         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
# Line 1498  DIFFERENCES FROM PERL Line 2246  DIFFERENCES FROM PERL
2246    
2247         4. Though binary zero characters are supported in the  subject  string,         4. Though binary zero characters are supported in the  subject  string,
2248         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2249         mal C string, terminated by zero. The escape sequence "\0" can be  used         mal C string, terminated by zero. The escape sequence \0 can be used in
2250         in the pattern to represent a binary zero.         the pattern to represent a binary zero.
2251    
2252         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
2253         \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general         \U, and \N. In fact these are implemented by Perl's general string-han-
2254         string-handling and are not part of its pattern matching engine. If any         dling  and are not part of its pattern matching engine. If any of these
2255         of these are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2256    
2257         6. PCRE does support the \Q...\E escape for quoting substrings. Charac-         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
2258         ters  in  between  are  treated as literals. This is slightly different         is  built  with Unicode character property support. The properties that
2259         from Perl in that $ and @ are  also  handled  as  literals  inside  the         can be tested with \p and \P are limited to the general category  prop-
2260         quotes.  In Perl, they cause variable interpolation (but of course PCRE         erties such as Lu and Nd.
2261    
2262           7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2263           ters in between are treated as literals.  This  is  slightly  different
2264           from  Perl  in  that  $  and  @ are also handled as literals inside the
2265           quotes. In Perl, they cause variable interpolation (but of course  PCRE
2266         does not have variables). Note the following examples:         does not have variables). Note the following examples:
2267    
2268             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 1519  DIFFERENCES FROM PERL Line 2272  DIFFERENCES FROM PERL
2272             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
2273             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
2274    
2275         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
2276         classes.         classes.
2277    
2278         7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
2279         constructions. However, there is some experimental support  for  recur-         constructions.  However,  there is support for recursive patterns using
2280         sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).         the non-Perl items (?R),  (?number),  and  (?P>name).  Also,  the  PCRE
2281         Also, the PCRE "callout" feature allows  an  external  function  to  be         "callout"  feature allows an external function to be called during pat-
2282         called during pattern matching.         tern matching. See the pcrecallout documentation for details.
2283    
2284         8.  There  are some differences that are concerned with the settings of         9. There are some differences that are concerned with the  settings  of
2285         captured strings when part of  a  pattern  is  repeated.  For  example,         captured  strings  when  part  of  a  pattern is repeated. For example,
2286         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
2287         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
2288    
2289         9. PCRE  provides  some  extensions  to  the  Perl  regular  expression         10. PCRE provides some extensions to the Perl regular expression facil-
2290         facilities:         ities:
2291    
2292         (a)  Although  lookbehind  assertions  must match fixed length strings,         (a) Although lookbehind assertions must  match  fixed  length  strings,
2293         each alternative branch of a lookbehind assertion can match a different         each alternative branch of a lookbehind assertion can match a different
2294         length of string. Perl requires them all to have the same length.         length of string. Perl requires them all to have the same length.
2295    
2296         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
2297         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2298    
2299         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2300         cial meaning is faulted.         cial meaning is faulted.
2301    
2302         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2303         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2304         lowed by a question mark they are.         lowed by a question mark they are.
2305    
2306         (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2307         the first matching position in the subject string.         tried only at the first matching position in the subject string.
2308    
2309         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2310         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2311    
2312         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive
2313         pattern matching (Perl can do  this  using  the  (?p{code})  construct,         pattern  matching  (Perl  can  do  this using the (?p{code}) construct,
2314         which PCRE cannot support.)         which PCRE cannot support.)
2315    
2316         (h)  PCRE supports named capturing substrings, using the Python syntax.         (h) PCRE supports named capturing substrings, using the Python  syntax.
2317    
2318         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from
2319         Sun's Java package.         Sun's Java package.
2320    
2321         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) The (R) condition, for testing recursion, is a PCRE extension.
2322    
2323         (k) The callout facility is PCRE-specific.         (k) The callout facility is PCRE-specific.
2324    
2325  Last updated: 09 December 2003         (l) The partial matching facility is PCRE-specific.
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
2326    
2327  PCRE(3)                                                                PCRE(3)         (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2328           even on different hosts that have the other endianness.
2329    
2330           (n)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2331           different way and is not Perl-compatible.
2332    
2333    Last updated: 28 February 2005
2334    Copyright (c) 1997-2005 University of Cambridge.
2335    ------------------------------------------------------------------------------
2336    
2337    
2338    PCREPATTERN(3)                                                  PCREPATTERN(3)
2339    
2340    
2341  NAME  NAME
2342         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2343    
2344    
2345  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2346    
2347         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax  and semantics of the regular expressions supported by PCRE
2348         are described below. Regular expressions are also described in the Perl         are described below. Regular expressions are also described in the Perl
2349         documentation  and in a number of other books, some of which have copi-         documentation  and  in  a  number  of books, some of which have copious
2350         ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-         examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
2351         lished  by  O'Reilly, covers them in great detail. The description here         by  O'Reilly, covers regular expressions in great detail. This descrip-
2352         is intended as reference documentation.         tion of PCRE's regular expressions is intended as reference material.
2353    
2354         The basic operation of PCRE is on strings of bytes. However,  there  is         The original operation of PCRE was on strings of  one-byte  characters.
2355         also  support for UTF-8 character strings. To use this support you must         However,  there is now also support for UTF-8 character strings. To use
2356         build PCRE to include UTF-8 support, and then call pcre_compile()  with         this, you must build PCRE to  include  UTF-8  support,  and  then  call
2357         the  PCRE_UTF8  option.  How  this affects the pattern matching is men-         pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
2358         tioned in several places below. There is also a summary of  UTF-8  fea-         matching is mentioned in several places below. There is also a  summary
2359         tures in the section on UTF-8 support in the main pcre page.         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
2360           page.
2361         A  regular  expression  is  a pattern that is matched against a subject  
2362         string from left to right. Most characters stand for  themselves  in  a         The remainder of this document discusses the  patterns  that  are  sup-
2363         pattern,  and  match  the corresponding characters in the subject. As a         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2364           From  release  6.0,   PCRE   offers   a   second   matching   function,
2365           pcre_dfa_exec(),  which matches using a different algorithm that is not
2366           Perl-compatible. The advantages and disadvantages  of  the  alternative
2367           function, and how it differs from the normal function, are discussed in
2368           the pcrematching page.
2369    
2370           A regular expression is a pattern that is  matched  against  a  subject
2371           string  from  left  to right. Most characters stand for themselves in a
2372           pattern, and match the corresponding characters in the  subject.  As  a
2373         trivial example, the pattern         trivial example, the pattern
2374    
2375           The quick brown fox           The quick brown fox
2376    
2377         matches a portion of a subject string that is identical to itself.  The         matches a portion of a subject string that is identical to itself. When
2378         power of regular expressions comes from the ability to include alterna-         caseless matching is specified (the PCRE_CASELESS option), letters  are
2379         tives and repetitions in the pattern. These are encoded in the  pattern         matched  independently  of case. In UTF-8 mode, PCRE always understands
2380         by  the  use  of meta-characters, which do not stand for themselves but         the concept of case for characters whose values are less than  128,  so
2381         instead are interpreted in some special way.         caseless  matching  is always possible. For characters with higher val-
2382           ues, the concept of case is supported if PCRE is compiled with  Unicode
2383           property  support,  but  not  otherwise.   If  you want to use caseless
2384           matching for characters 128 and above, you must  ensure  that  PCRE  is
2385           compiled with Unicode property support as well as with UTF-8 support.
2386    
2387           The  power  of  regular  expressions  comes from the ability to include
2388           alternatives and repetitions in the pattern. These are encoded  in  the
2389           pattern by the use of metacharacters, which do not stand for themselves
2390           but instead are interpreted in some special way.
2391    
2392         There are two different sets of meta-characters: those that are  recog-         There are two different sets of metacharacters: those that  are  recog-
2393         nized  anywhere in the pattern except within square brackets, and those         nized  anywhere in the pattern except within square brackets, and those
2394         that are recognized in square brackets. Outside  square  brackets,  the         that are recognized in square brackets. Outside  square  brackets,  the
2395         meta-characters are as follows:         metacharacters are as follows:
2396    
2397           \      general escape character with several uses           \      general escape character with several uses
2398           ^      assert start of string (or line, in multiline mode)           ^      assert start of string (or line, in multiline mode)
# Line 1631  PCRE REGULAR EXPRESSION DETAILS Line 2411  PCRE REGULAR EXPRESSION DETAILS
2411           {      start min/max quantifier           {      start min/max quantifier
2412    
2413         Part  of  a  pattern  that is in square brackets is called a "character         Part  of  a  pattern  that is in square brackets is called a "character
2414         class". In a character class the only meta-characters are:         class". In a character class the only metacharacters are:
2415    
2416           \      general escape character           \      general escape character
2417           ^      negate the class, but only if the first character           ^      negate the class, but only if the first character
# Line 1640  PCRE REGULAR EXPRESSION DETAILS Line 2420  PCRE REGULAR EXPRESSION DETAILS
2420                    syntax)                    syntax)
2421           ]      terminates the character class           ]      terminates the character class
2422    
2423         The following sections describe the use of each of the meta-characters.         The following sections describe the use of each of the  metacharacters.
2424    
2425    
2426  BACKSLASH  BACKSLASH
2427    
2428         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2429         a non-alphameric character, it takes  away  any  special  meaning  that         a non-alphanumeric character, it takes away any  special  meaning  that
2430         character  may  have.  This  use  of  backslash  as an escape character         character  may  have.  This  use  of  backslash  as an escape character
2431         applies both inside and outside character classes.         applies both inside and outside character classes.
2432    
2433         For example, if you want to match a * character, you write  \*  in  the         For example, if you want to match a * character, you write  \*  in  the
2434         pattern.   This  escaping  action  applies whether or not the following         pattern.   This  escaping  action  applies whether or not the following
2435         character would otherwise be interpreted as a meta-character, so it  is         character would otherwise be interpreted as a metacharacter, so  it  is
2436         always  safe to precede a non-alphameric with backslash to specify that         always  safe  to  precede  a non-alphanumeric with backslash to specify
2437         it stands for itself. In particular, if you want to match a  backslash,         that it stands for itself. In particular, if you want to match a  back-
2438         you write \\.         slash, you write \\.
2439    
2440         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
2441         the pattern (other than in a character class) and characters between  a         the pattern (other than in a character class) and characters between  a
# Line 1679  BACKSLASH Line 2459  BACKSLASH
2459         The  \Q...\E  sequence  is recognized both inside and outside character         The  \Q...\E  sequence  is recognized both inside and outside character
2460         classes.         classes.
2461    
2462       Non-printing characters
2463    
2464         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2465         acters  in patterns in a visible manner. There is no restriction on the         acters  in patterns in a visible manner. There is no restriction on the
2466         appearance of non-printing characters, apart from the binary zero  that         appearance of non-printing characters, apart from the binary zero  that
# Line 1708  BACKSLASH Line 2490  BACKSLASH
2490         must  be  less  than  2**31  (that is, the maximum hexadecimal value is         must  be  less  than  2**31  (that is, the maximum hexadecimal value is
2491         7FFFFFFF). If characters other than hexadecimal digits  appear  between         7FFFFFFF). If characters other than hexadecimal digits  appear  between
2492         \x{  and }, or if there is no terminating }, this form of escape is not         \x{  and }, or if there is no terminating }, this form of escape is not
2493         recognized. Instead, the initial \x will be interpreted as a basic hex-         recognized. Instead, the initial \x will  be  interpreted  as  a  basic
2494         adecimal escape, with no following digits, giving a byte whose value is         hexadecimal  escape, with no following digits, giving a character whose
2495         zero.         value is zero.
2496    
2497         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2498         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
# Line 1721  BACKSLASH Line 2503  BACKSLASH
2503         there are fewer than two digits, just those that are present are  used.         there are fewer than two digits, just those that are present are  used.
2504         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
2505         character (code value 7). Make sure you supply  two  digits  after  the         character (code value 7). Make sure you supply  two  digits  after  the
2506         initial zero if the character that follows is itself an octal digit.         initial  zero  if the pattern character that follows is itself an octal
2507           digit.
2508    
2509         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2510         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2511         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
2512         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2513         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
2514         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
2515         of parenthesized subpatterns.         of parenthesized subpatterns.
2516    
2517         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
2518         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
2519         up  to three octal digits following the backslash, and generates a sin-         up to three octal digits following the backslash, and generates a  sin-
2520         gle byte from the least significant 8 bits of the value. Any subsequent         gle byte from the least significant 8 bits of the value. Any subsequent
2521         digits stand for themselves.  For example:         digits stand for themselves.  For example:
2522    
# Line 1752  BACKSLASH Line 2535  BACKSLASH
2535           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2536                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2537    
2538         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
2539         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2540    
2541         All the sequences that define a single byte value  or  a  single  UTF-8         All  the  sequences  that  define a single byte value or a single UTF-8
2542         character (in UTF-8 mode) can be used both inside and outside character         character (in UTF-8 mode) can be used both inside and outside character
2543         classes. In addition, inside a character  class,  the  sequence  \b  is         classes.  In  addition,  inside  a  character class, the sequence \b is
2544         interpreted  as  the  backspace character (hex 08). Outside a character         interpreted as the backspace character (hex 08), and the sequence \X is
2545         class it has a different meaning (see below).         interpreted  as  the  character  "X".  Outside a character class, these
2546           sequences have different meanings (see below).
2547    
2548         The third use of backslash is for specifying generic character types:     Generic character types
2549    
2550           The third use of backslash is for specifying generic  character  types.
2551           The following are always recognized:
2552    
2553           \d     any decimal digit           \d     any decimal digit
2554           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
# Line 1771  BACKSLASH Line 2558  BACKSLASH
2558           \W     any "non-word" character           \W     any "non-word" character
2559    
2560         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2561         into  two disjoint sets. Any given character matches one, and only one,         into two disjoint sets. Any given character matches one, and only  one,
2562         of each pair.         of each pair.
2563    
        In UTF-8 mode, characters with values greater than 255 never match  \d,  
        \s, or \w, and always match \D, \S, and \W.  
   
        For  compatibility  with Perl, \s does not match the VT character (code  
        11).  This makes it different from the the POSIX "space" class. The  \s  
        characters are HT (9), LF (10), FF (12), CR (13), and space (32).  
   
        A  "word" character is any letter or digit or the underscore character,  
        that is, any character which can be part of a Perl "word". The  defini-  
        tion  of  letters  and digits is controlled by PCRE's character tables,  
        and may vary if locale- specific matching is taking place (see  "Locale  
        support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)  
        locale, some character codes greater than 128  are  used  for  accented  
        letters, and these are matched by \w.  
   
2564         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2565         acter classes. They each match one character of the  appropriate  type.         acter classes. They each match one character of the  appropriate  type.
2566         If  the current matching point is at the end of the subject string, all         If  the current matching point is at the end of the subject string, all
2567         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2568    
2569           For compatibility with Perl, \s does not match the VT  character  (code
2570           11).   This makes it different from the the POSIX "space" class. The \s
2571           characters are HT (9), LF (10), FF (12), CR (13), and space (32).
2572    
2573           A "word" character is an underscore or any character less than 256 that
2574           is  a  letter  or  digit.  The definition of letters and digits is con-
2575           trolled by PCRE's low-valued character tables, and may vary if  locale-
2576           specific  matching is taking place (see "Locale support" in the pcreapi
2577           page). For example, in the  "fr_FR"  (French)  locale,  some  character
2578           codes  greater  than  128  are used for accented letters, and these are
2579           matched by \w.
2580    
2581           In UTF-8 mode, characters with values greater than 128 never match  \d,
2582           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2583           code character property support is available.
2584    
2585       Unicode character properties
2586    
2587           When PCRE is built with Unicode character property support, three addi-
2588           tional  escape sequences to match generic character types are available
2589           when UTF-8 mode is selected. They are:
2590    
2591            \p{xx}   a character with the xx property
2592            \P{xx}   a character without the xx property
2593            \X       an extended Unicode sequence
2594    
2595           The property names represented by xx above are limited to  the  Unicode
2596           general  category properties. Each character has exactly one such prop-
2597           erty, specified by a two-letter abbreviation.  For  compatibility  with
2598           Perl,  negation  can be specified by including a circumflex between the
2599           opening brace and the property name. For example, \p{^Lu} is  the  same
2600           as \P{Lu}.
2601    
2602           If  only  one  letter  is  specified with \p or \P, it includes all the
2603           properties that start with that letter. In this case, in the absence of
2604           negation, the curly brackets in the escape sequence are optional; these
2605           two examples have the same effect:
2606    
2607             \p{L}
2608             \pL
2609    
2610           The following property codes are supported:
2611    
2612             C     Other
2613             Cc    Control
2614             Cf    Format
2615             Cn    Unassigned
2616             Co    Private use
2617             Cs    Surrogate
2618    
2619             L     Letter
2620             Ll    Lower case letter
2621             Lm    Modifier letter
2622             Lo    Other letter
2623             Lt    Title case letter
2624             Lu    Upper case letter
2625    
2626             M     Mark
2627             Mc    Spacing mark
2628             Me    Enclosing mark
2629             Mn    Non-spacing mark
2630    
2631             N     Number
2632             Nd    Decimal number
2633             Nl    Letter number
2634             No    Other number
2635    
2636             P     Punctuation
2637             Pc    Connector punctuation
2638             Pd    Dash punctuation
2639             Pe    Close punctuation
2640             Pf    Final punctuation
2641             Pi    Initial punctuation
2642             Po    Other punctuation
2643             Ps    Open punctuation
2644    
2645             S     Symbol
2646             Sc    Currency symbol
2647             Sk    Modifier symbol
2648             Sm    Mathematical symbol
2649             So    Other symbol
2650    
2651             Z     Separator
2652             Zl    Line separator
2653             Zp    Paragraph separator
2654             Zs    Space separator
2655    
2656           Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-
2657           ported by PCRE.
2658    
2659           Specifying  caseless  matching  does not affect these escape sequences.
2660           For example, \p{Lu} always matches only upper case letters.
2661    
2662           The \X escape matches any number of Unicode  characters  that  form  an
2663           extended Unicode sequence. \X is equivalent to
2664    
2665             (?>\PM\pM*)
2666    
2667           That  is,  it matches a character without the "mark" property, followed
2668           by zero or more characters with the "mark"  property,  and  treats  the
2669           sequence  as  an  atomic group (see below).  Characters with the "mark"
2670           property are typically accents that affect the preceding character.
2671    
2672           Matching characters by Unicode property is not fast, because  PCRE  has
2673           to  search  a  structure  that  contains data for over fifteen thousand
2674           characters. That is why the traditional escape sequences such as \d and
2675           \w do not use Unicode properties in PCRE.
2676    
2677       Simple assertions
2678    
2679         The fourth use of backslash is for certain simple assertions. An asser-         The fourth use of backslash is for certain simple assertions. An asser-
2680         tion  specifies a condition that has to be met at a particular point in         tion specifies a condition that has to be met at a particular point  in
2681         a match, without consuming any characters from the subject string.  The         a  match, without consuming any characters from the subject string. The
2682         use  of subpatterns for more complicated assertions is described below.         use of subpatterns for more complicated assertions is described  below.
2683         The backslashed assertions are         The backslashed assertions are:
2684    
2685           \b     matches at a word boundary           \b     matches at a word boundary
2686           \B     matches when not at a word boundary           \B     matches when not at a word boundary
# Line 1807  BACKSLASH Line 2689  BACKSLASH
2689           \z     matches at end of subject           \z     matches at end of subject
2690           \G     matches at first matching position in subject           \G     matches at first matching position in subject
2691    
2692         These assertions may not appear in character classes (but note that  \b         These  assertions may not appear in character classes (but note that \b
2693         has a different meaning, namely the backspace character, inside a char-         has a different meaning, namely the backspace character, inside a char-
2694         acter class).         acter class).
2695    
2696         A word boundary is a position in the subject string where  the  current         A  word  boundary is a position in the subject string where the current
2697         character  and  the previous character do not both match \w or \W (i.e.         character and the previous character do not both match \w or  \W  (i.e.
2698         one matches \w and the other matches \W), or the start or  end  of  the         one  matches  \w  and the other matches \W), or the start or end of the
2699         string if the first or last character matches \w, respectively.         string if the first or last character matches \w, respectively.
2700    
2701         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The \A, \Z, and \z assertions differ from  the  traditional  circumflex
2702         and dollar (described below) in that they only ever match at  the  very         and dollar (described in the next section) in that they only ever match
2703         start  and  end  of the subject string, whatever options are set. Thus,         at the very start and end of the subject string, whatever  options  are
2704         they are independent of multiline mode.         set.  Thus,  they are independent of multiline mode. These three asser-
2705           tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
2706         They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the         affect  only the behaviour of the circumflex and dollar metacharacters.
2707         startoffset argument of pcre_exec() is non-zero, indicating that match-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
2708         ing is to start at a point other than the beginning of the subject,  \A         cating that matching is to start at a point other than the beginning of
2709         can  never  match.  The difference between \Z and \z is that \Z matches         the subject, \A can never match. The difference between \Z  and  \z  is
2710         before a newline that is the last character of the string as well as at         that  \Z  matches  before  a  newline that is the last character of the
2711         the end of the string, whereas \z matches only at the end.         string as well as at the end of the string, whereas \z matches only  at
2712           the end.
2713    
2714         The  \G assertion is true only when the current matching position is at         The  \G assertion is true only when the current matching position is at
2715         the start point of the match, as specified by the startoffset  argument         the start point of the match, as specified by the startoffset  argument
# Line 1849  BACKSLASH Line 2732  BACKSLASH
2732  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
2733    
2734         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
2735         character  is  an  assertion which is true only if the current matching         character  is  an  assertion  that is true only if the current matching
2736         point is at the start of the subject string. If the  startoffset  argu-         point is at the start of the subject string. If the  startoffset  argu-
2737         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
2738         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
# Line 1863  CIRCUMFLEX AND DOLLAR Line 2746  CIRCUMFLEX AND DOLLAR
2746         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject, it is said to be an "anchored" pattern.  (There  are  also  other
2747         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
2748    
2749         A  dollar  character  is an assertion which is true only if the current         A  dollar  character  is  an assertion that is true only if the current
2750         matching point is at the end of  the  subject  string,  or  immediately         matching point is at the end of  the  subject  string,  or  immediately
2751         before a newline character that is the last character in the string (by         before a newline character that is the last character in the string (by
2752         default). Dollar need not be the last character of  the  pattern  if  a         default). Dollar need not be the last character of  the  pattern  if  a
# Line 1880  CIRCUMFLEX AND DOLLAR Line 2763  CIRCUMFLEX AND DOLLAR
2763         ately  after  and  immediately  before  an  internal newline character,         ately  after  and  immediately  before  an  internal newline character,
2764         respectively, in addition to matching at the start and end of the  sub-         respectively, in addition to matching at the start and end of the  sub-
2765         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
2766         string "def\nabc" in multiline mode, but not  otherwise.  Consequently,         string "def\nabc" (where \n represents a newline character)  in  multi-
2767         patterns  that  are  anchored  in single line mode because all branches         line mode, but not otherwise.  Consequently, patterns that are anchored
2768         start with ^ are not anchored in multiline mode, and a match  for  cir-         in single line mode because all branches start with ^ are not  anchored
2769         cumflex  is  possible  when  the startoffset argument of pcre_exec() is         in  multiline  mode,  and  a  match for circumflex is possible when the
2770         non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE         startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-
2771         is set.         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
2772    
2773         Note  that  the sequences \A, \Z, and \z can be used to match the start         Note  that  the sequences \A, \Z, and \z can be used to match the start
2774         and end of the subject in both modes, and if all branches of a  pattern         and end of the subject in both modes, and if all branches of a  pattern
# Line 1898  FULL STOP (PERIOD, DOT) Line 2781  FULL STOP (PERIOD, DOT)
2781         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2782         ter  in  the  subject,  including a non-printing character, but not (by         ter  in  the  subject,  including a non-printing character, but not (by
2783         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
2784         which  might  be  more than one byte long, except (by default) for new-         which might be more than one byte long, except (by default) newline. If
2785         line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.         the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-
2786         The  handling of dot is entirely independent of the handling of circum-         dling  of dot is entirely independent of the handling of circumflex and
2787         flex and dollar, the only relationship being  that  they  both  involve         dollar, the only relationship being  that  they  both  involve  newline
2788         newline characters. Dot has no special meaning in a character class.         characters. Dot has no special meaning in a character class.
2789    
2790    
2791  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
2792    
2793         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
2794         both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-         both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.
2795         line.  The  feature  is  provided  in Perl in order to match individual         The  feature  is provided in Perl in order to match individual bytes in
2796         bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-         UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual
2797         vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8         bytes,  what remains in the string may be a malformed UTF-8 string. For
2798         string. For this reason it is best avoided.         this reason, the \C escape sequence is best avoided.
2799    
2800         PCRE does not allow \C to appear in lookbehind assertions (see  below),         PCRE does not allow \C to appear in  lookbehind  assertions  (described
2801         because in UTF-8 mode it makes it impossible to calculate the length of         below),  because  in UTF-8 mode this would make it impossible to calcu-
2802         the lookbehind.         late the length of the lookbehind.
2803    
2804    
2805  SQUARE BRACKETS  SQUARE BRACKETS AND CHARACTER CLASSES
2806    
2807         An opening square bracket introduces a character class, terminated by a         An opening square bracket introduces a character class, terminated by a
2808         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
# Line 1938  SQUARE BRACKETS Line 2821  SQUARE BRACKETS
2821         For example, the character class [aeiou] matches any lower case  vowel,         For example, the character class [aeiou] matches any lower case  vowel,
2822         while  [^aeiou]  matches  any character that is not a lower case vowel.         while  [^aeiou]  matches  any character that is not a lower case vowel.
2823         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
2824         characters which are in the class by enumerating those that are not. It         characters  that  are in the class by enumerating those that are not. A
2825         is not an assertion: it still consumes a  character  from  the  subject         class that starts with a circumflex is not an assertion: it still  con-
2826         string, and fails if the current pointer is at the end of the string.         sumes  a  character  from the subject string, and therefore it fails if
2827           the current pointer is at the end of the string.
2828    
2829         In  UTF-8 mode, characters with values greater than 255 can be included         In UTF-8 mode, characters with values greater than 255 can be  included
2830         in a class as a literal string of bytes, or by using the  \x{  escaping         in  a  class as a literal string of bytes, or by using the \x{ escaping
2831         mechanism.         mechanism.
2832    
2833         When  caseless  matching  is set, any letters in a class represent both         When caseless matching is set, any letters in a  class  represent  both
2834         their upper case and lower case versions, so for  example,  a  caseless         their  upper  case  and lower case versions, so for example, a caseless
2835         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
2836         match "A", whereas a caseful version would. PCRE does not  support  the         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
2837         concept of case for characters with values greater than 255.         understands the concept of case for characters whose  values  are  less
2838           than  128, so caseless matching is always possible. For characters with
2839           higher values, the concept of case is supported  if  PCRE  is  compiled
2840           with  Unicode  property support, but not otherwise.  If you want to use
2841           caseless matching for characters 128 and above, you  must  ensure  that
2842           PCRE  is  compiled  with Unicode property support as well as with UTF-8
2843           support.
2844    
2845         The  newline character is never treated in any special way in character         The newline character is never treated in any special way in  character
2846         classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
2847         options is. A class such as [^a] will always match a newline.         options is. A class such as [^a] will always match a newline.
2848    
2849         The  minus (hyphen) character can be used to specify a range of charac-         The minus (hyphen) character can be used to specify a range of  charac-
2850         ters in a character  class.  For  example,  [d-m]  matches  any  letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
2851         between  d  and  m,  inclusive.  If  a minus character is required in a         between d and m, inclusive. If a  minus  character  is  required  in  a
2852         class, it must be escaped with a backslash  or  appear  in  a  position         class,  it  must  be  escaped  with a backslash or appear in a position
2853         where  it cannot be interpreted as indicating a range, typically as the         where it cannot be interpreted as indicating a range, typically as  the
2854         first or last character in the class.         first or last character in the class.
2855    
2856         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
2857         ter  of a range. A pattern such as [W-]46] is interpreted as a class of         ter of a range. A pattern such as [W-]46] is interpreted as a class  of
2858         two characters ("W" and "-") followed by a literal string "46]", so  it         two  characters ("W" and "-") followed by a literal string "46]", so it
2859         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
2860         backslash it is interpreted as the end of range, so [W-\]46] is  inter-         backslash  it is interpreted as the end of range, so [W-\]46] is inter-
2861         preted  as  a  single class containing a range followed by two separate         preted as a class containing a range followed by two other  characters.
2862         characters. The octal or hexadecimal representation of "]" can also  be         The  octal or hexadecimal representation of "]" can also be used to end
2863         used to end a range.         a range.
2864    
2865         Ranges  operate in the collating sequence of character values. They can         Ranges operate in the collating sequence of character values. They  can
2866         also  be  used  for  characters  specified  numerically,  for   example         also   be  used  for  characters  specified  numerically,  for  example
2867         [\000-\037].  In UTF-8 mode, ranges can include characters whose values         [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
2868         are greater than 255, for example [\x{100}-\x{2ff}].         are greater than 255, for example [\x{100}-\x{2ff}].
2869    
2870         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
2871         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
2872         to [][\^_`wxyzabc], matched caselessly, and if character tables for the         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
2873         "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
2874         both cases.         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
2875           concept of case for characters with values greater than 128  only  when
2876         The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a         it is compiled with Unicode property support.
2877         character  class,  and add the characters that they match to the class.  
2878         For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can         The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
2879         conveniently  be  used with the upper case character types to specify a         in a character class, and add the characters that  they  match  to  the
2880         more restricted set of characters than the matching  lower  case  type.         class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
2881         For  example,  the  class  [^\W_]  matches any letter or digit, but not         flex can conveniently be used with the upper case  character  types  to
2882         underscore.         specify  a  more  restricted  set of characters than the matching lower
2883           case type. For example, the class [^\W_] matches any letter  or  digit,
2884         All non-alphameric characters other than \, -, ^ (at the start) and the         but not underscore.
2885         terminating ] are non-special in character classes, but it does no harm  
2886         if they are escaped.         The  only  metacharacters  that are recognized in character classes are
2887           backslash, hyphen (only where it can be  interpreted  as  specifying  a
2888           range),  circumflex  (only  at the start), opening square bracket (only
2889           when it can be interpreted as introducing a POSIX class name - see  the
2890           next  section),  and  the  terminating closing square bracket. However,
2891           escaping other non-alphanumeric characters does no harm.
2892    
2893    
2894  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
2895    
2896         Perl supports the POSIX notation  for  character  classes,  which  uses         Perl supports the POSIX notation for character classes. This uses names
2897         names  enclosed by [: and :] within the enclosing square brackets. PCRE         enclosed  by  [: and :] within the enclosing square brackets. PCRE also
2898         also supports this notation. For example,         supports this notation. For example,
2899    
2900           [01[:alpha:]%]           [01[:alpha:]%]
2901    
# Line 2037  POSIX CHARACTER CLASSES Line 2932  POSIX CHARACTER CLASSES
2932         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2933         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
2934    
2935         In UTF-8 mode, characters with values greater than 255 do not match any         In UTF-8 mode, characters with values greater than 128 do not match any
2936         of the POSIX character classes.         of the POSIX character classes.
2937    
2938    
# Line 2104  INTERNAL OPTION SETTING Line 2999  INTERNAL OPTION SETTING
2999         in the same way as the Perl-compatible options by using the  characters         in the same way as the Perl-compatible options by using the  characters
3000         U  and X respectively. The (?X) flag setting is special in that it must         U  and X respectively. The (?X) flag setting is special in that it must
3001         always occur earlier in the pattern than any of the additional features         always occur earlier in the pattern than any of the additional features
3002         it turns on, even when it is at top level. It is best put at the start.         it  turns on, even when it is at top level. It is best to put it at the
3003           start.
3004    
3005    
3006  SUBPATTERNS  SUBPATTERNS
3007    
3008         Subpatterns are delimited by parentheses (round brackets), which can be         Subpatterns are delimited by parentheses (round brackets), which can be
3009         nested.  Marking part of a pattern as a subpattern does two things:         nested.  Turning part of a pattern into a subpattern does two things:
3010    
3011         1. It localizes a set of alternatives. For example, the pattern         1. It localizes a set of alternatives. For example, the pattern
3012    
# Line 2120  SUBPATTERNS Line 3016  SUBPATTERNS
3016         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the parentheses, it would match "cataract",  "erpillar"  or  the  empty
3017         string.         string.
3018    
3019         2.  It  sets  up  the  subpattern as a capturing subpattern (as defined         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
3020         above).  When the whole pattern matches, that portion  of  the  subject         that, when the whole pattern  matches,  that  portion  of  the  subject
3021         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
3022         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
3023         left  to right (starting from 1) to obtain the numbers of the capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
3024         subpatterns.         subpatterns.
3025    
3026         For example, if the string "the red king" is matched against  the  pat-         For example, if the string "the red king" is matched against  the  pat-
# Line 2169  NAMED SUBPATTERNS Line 3065  NAMED SUBPATTERNS
3065         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying  capturing  parentheses  by number is simple, but it can be
3066         very hard to keep track of the numbers in complicated  regular  expres-         very hard to keep track of the numbers in complicated  regular  expres-
3067         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
3068         change. To help with the difficulty, PCRE supports the naming  of  sub-         change. To help with this difficulty, PCRE supports the naming of  sub-
3069         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns,  something  that  Perl  does  not  provide. The Python syntax
3070         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
3071         underscores, and must be unique within a pattern.         underscores, and must be unique within a pattern.
3072    
3073         Named  capturing  parentheses  are  still  allocated numbers as well as         Named  capturing  parentheses  are  still  allocated numbers as well as
3074         names. The PCRE API provides function calls for extracting the name-to-         names. The PCRE API provides function calls for extracting the name-to-
3075         number  translation  table from a compiled pattern. For further details         number  translation table from a compiled pattern. There is also a con-
3076         see the pcreapi documentation.         venience function for extracting a captured substring by name. For fur-
3077           ther details see the pcreapi documentation.
3078    
3079    
3080  REPETITION  REPETITION
3081    
3082         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
3083         following items:         following items:
3084    
3085           a literal data character           a literal data character
3086           the . metacharacter           the . metacharacter
3087           the \C escape sequence           the \C escape sequence
3088           escapes such as \d that match single characters           the \X escape sequence (in UTF-8 mode with Unicode properties)
3089             an escape such as \d that matches a single character
3090           a character class           a character class
3091           a back reference (see next section)           a back reference (see next section)
3092           a parenthesized subpattern (unless it is an assertion)           a parenthesized subpattern (unless it is an assertion)
3093    
3094         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
3095         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
3096         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
3097         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
3098    
3099           z{2,4}           z{2,4}
3100    
3101         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
3102         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
3103         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
3104         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
3105         matches. Thus         matches. Thus
3106    
3107           [aeiou]{3,}           [aeiou]{3,}
# Line 2212  REPETITION Line 3110  REPETITION
3110    
3111           \d{8}           \d{8}
3112    
3113         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
3114         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
3115         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
3116         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
3117    
3118         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
3119         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
3120         acters, each of which is represented by a two-byte sequence.         acters, each of which is represented by a two-byte sequence. Similarly,
3121           when Unicode property support is available, \X{3} matches three Unicode
3122           extended  sequences,  each of which may be several bytes long (and they
3123           may be of different lengths).
3124    
3125         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
3126         the previous item and the quantifier were not present.         the previous item and the quantifier were not present.
# Line 2247  REPETITION Line 3148  REPETITION
3148         as  possible  (up  to  the  maximum number of permitted times), without         as  possible  (up  to  the  maximum number of permitted times), without
3149         causing the rest of the pattern to fail. The classic example  of  where         causing the rest of the pattern to fail. The classic example  of  where
3150         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
3151         appear between the sequences /* and */ and within the  sequence,  indi-         appear between /* and */ and within the comment,  individual  *  and  /
3152         vidual * and / characters may appear. An attempt to match C comments by         characters  may  appear. An attempt to match C comments by applying the
3153         applying the pattern         pattern
3154    
3155           /\*.*\*/           /\*.*\*/
3156    
3157         to the string         to the string
3158    
3159           /* first command */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
3160    
3161         fails, because it matches the entire string owing to the greediness  of         fails, because it matches the entire string owing to the greediness  of
3162         the .*  item.         the .*  item.
# Line 2283  REPETITION Line 3184  REPETITION
3184         words, it inverts the default behaviour.         words, it inverts the default behaviour.
3185    
3186         When  a  parenthesized  subpattern  is quantified with a minimum repeat         When  a  parenthesized  subpattern  is quantified with a minimum repeat
3187         count that is greater than 1 or with a limited maximum, more  store  is         count that is greater than 1 or with a limited maximum, more memory  is
3188         required  for  the  compiled  pattern, in proportion to the size of the         required  for  the  compiled  pattern, in proportion to the size of the
3189         minimum or maximum.         minimum or maximum.
3190    
# Line 2374  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3275  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3275         consists  of  an  additional  + character following a quantifier. Using         consists  of  an  additional  + character following a quantifier. Using
3276         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
3277    
3278           \d++bar           \d++foo
3279    
3280         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
3281         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
# Line 2399  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3300  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3300           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3301    
3302         it takes a long time before reporting  failure.  This  is  because  the         it takes a long time before reporting  failure.  This  is  because  the
3303         string  can  be  divided  between  the two repeats in a large number of         string  can be divided between the internal \D+ repeat and the external
3304         ways, and all have to be tried. (The example used [!?]  rather  than  a         * repeat in a large number of ways, and all  have  to  be  tried.  (The
3305         single  character  at the end, because both PCRE and Perl have an opti-         example  uses  [!?]  rather than a single character at the end, because
3306         mization that allows for fast failure when a single character is  used.         both PCRE and Perl have an optimization that allows  for  fast  failure
3307         They  remember  the last single character that is required for a match,         when  a single character is used. They remember the last single charac-
3308         and fail early if it is not present in the string.)  If the pattern  is         ter that is required for a match, and fail early if it is  not  present
3309         changed to         in  the  string.)  If  the pattern is changed so that it uses an atomic
3310           group, like this:
3311    
3312           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
3313    
3314         sequences  of non-digits cannot be broken, and failure happens quickly.         sequences of non-digits cannot be broken, and failure happens  quickly.
3315    
3316    
3317  BACK REFERENCES  BACK REFERENCES
3318    
3319         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
3320         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
3321         pattern earlier (that is, to its left) in the pattern,  provided  there         pattern  earlier  (that is, to its left) in the pattern, provided there
3322         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
3323    
3324         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
3325         it is always taken as a back reference, and causes  an  error  only  if         it  is  always  taken  as a back reference, and causes an error only if
3326         there  are  not that many capturing left parentheses in the entire pat-         there are not that many capturing left parentheses in the  entire  pat-
3327         tern. In other words, the parentheses that are referenced need  not  be         tern.  In  other words, the parentheses that are referenced need not be
3328         to  the left of the reference for numbers less than 10. See the section         to the left of the reference for numbers less than 10. See the  subsec-
3329         entitled "Backslash" above for further details of the handling of  dig-         tion  entitled  "Non-printing  characters" above for further details of
3330         its following a backslash.         the handling of digits following a backslash.
3331    
3332         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
3333         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
3334         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3335         of doing that). So the pattern         of doing that). So the pattern
3336    
3337           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3338    
3339         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
3340         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
3341         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
3342         ple,         ple,
3343    
3344           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3345    
3346         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
3347         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3348    
3349         Back references to named subpatterns use the Python  syntax  (?P=name).         Back  references  to named subpatterns use the Python syntax (?P=name).
3350         We could rewrite the above example as follows:         We could rewrite the above example as follows:
3351    
3352           (?<p1>(?i)rah)\s+(?P=p1)           (?<p1>(?i)rah)\s+(?P=p1)
3353    
3354         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
3355         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
3356         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
3357    
3358           (a|(bc))\2           (a|(bc))\2
3359    
3360         always  fails if it starts to match "a" rather than "bc". Because there         always fails if it starts to match "a" rather than "bc". Because  there
3361         may be many capturing parentheses in a pattern,  all  digits  following         may  be  many  capturing parentheses in a pattern, all digits following
3362         the  backslash  are taken as part of a potential back reference number.         the backslash are taken as part of a potential back  reference  number.
3363         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
3364         used  to  terminate  the back reference. If the PCRE_EXTENDED option is         used to terminate the back reference. If the  PCRE_EXTENDED  option  is
3365         set, this can be whitespace.  Otherwise an empty comment can be used.         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
3366           ments" below) can be used.
3367    
3368         A back reference that occurs inside the parentheses to which it  refers         A back reference that occurs inside the parentheses to which it  refers
3369         fails  when  the subpattern is first used, so, for example, (a\1) never         fails  when  the subpattern is first used, so, for example, (a\1) never
# Line 2482  ASSERTIONS Line 3385  ASSERTIONS
3385         An assertion is a test on the characters  following  or  preceding  the         An assertion is a test on the characters  following  or  preceding  the
3386         current  matching  point that does not actually consume any characters.         current  matching  point that does not actually consume any characters.
3387         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
3388         described above.  More complicated assertions are coded as subpatterns.         described above.
3389         There are two kinds: those that look ahead of the current  position  in  
3390         the subject string, and those that look behind it.         More  complicated  assertions  are  coded as subpatterns. There are two
3391           kinds: those that look ahead of the current  position  in  the  subject
3392         An  assertion  subpattern  is matched in the normal way, except that it         string,  and  those  that  look  behind  it. An assertion subpattern is
3393         does not cause the current matching position to be  changed.  Lookahead         matched in the normal way, except that it does not  cause  the  current
3394         assertions  start with (?= for positive assertions and (?! for negative         matching position to be changed.
3395         assertions. For example,  
3396           Assertion  subpatterns  are  not  capturing subpatterns, and may not be
3397           repeated, because it makes no sense to assert the  same  thing  several
3398           times.  If  any kind of assertion contains capturing subpatterns within
3399           it, these are counted for the purposes of numbering the capturing  sub-
3400           patterns in the whole pattern.  However, substring capturing is carried
3401           out only for positive assertions, because it does not  make  sense  for
3402           negative assertions.
3403    
3404       Lookahead assertions
3405    
3406           Lookahead assertions start with (?= for positive assertions and (?! for
3407           negative assertions. For example,
3408    
3409           \w+(?=;)           \w+(?=;)
3410    
# Line 2506  ASSERTIONS Line 3421  ASSERTIONS
3421         does not find an occurrence of "bar"  that  is  preceded  by  something         does not find an occurrence of "bar"  that  is  preceded  by  something
3422         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
3423         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
3424         "bar". A lookbehind assertion is needed to achieve this effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
3425    
3426         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
3427         most convenient way to do it is  with  (?!)  because  an  empty  string         most convenient way to do it is  with  (?!)  because  an  empty  string
3428         always  matches, so an assertion that requires there not to be an empty         always  matches, so an assertion that requires there not to be an empty
3429         string must always fail.         string must always fail.
3430    
3431       Lookbehind assertions
3432    
3433         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
3434         for negative assertions. For example,         for negative assertions. For example,
3435    
# Line 2551  ASSERTIONS Line 3468  ASSERTIONS
3468    
3469         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
3470         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
3471         ble to calculate the length of the lookbehind.         ble to calculate the length of the lookbehind. The \X escape, which can
3472           match different numbers of bytes, is also not permitted.
3473    
3474         Atomic groups can be used in conjunction with lookbehind assertions  to         Atomic  groups can be used in conjunction with lookbehind assertions to
3475         specify efficient matching at the end of the subject string. Consider a         specify efficient matching at the end of the subject string. Consider a
3476         simple pattern such as         simple pattern such as
3477    
3478           abcd$           abcd$
3479    
3480         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
3481         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
3482         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
3483         pattern is specified as         pattern is specified as
3484    
3485           ^.*abcd$           ^.*abcd$
3486    
3487         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
3488         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
3489         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
3490         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
3491         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
3492    
3493           ^(?>.*)(?<=abcd)           ^(?>.*)(?<=abcd)
3494    
3495         or, equivalently,         or, equivalently, using the possessive quantifier syntax,
3496    
3497           ^.*+(?<=abcd)           ^.*+(?<=abcd)
3498    
3499         there  can  be  no  backtracking for the .* item; it can match only the         there can be no backtracking for the .* item; it  can  match  only  the
3500         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
3501         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
3502         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
3503         processing time.         processing time.
3504    
3505       Using multiple assertions
3506    
3507         Several assertions (of any sort) may occur in succession. For example,         Several assertions (of any sort) may occur in succession. For example,
3508    
3509           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
3510    
3511         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
3512         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
3513         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
3514         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
3515         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
3516         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
3517         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
3518         foo". A pattern to do that is         foo". A pattern to do that is
3519    
3520           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
3521    
3522         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
3523         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
3524         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
3525    
# Line 2607  ASSERTIONS Line 3527  ASSERTIONS
3527    
3528           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
3529    
3530         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
3531         is not preceded by "foo", while         is not preceded by "foo", while
3532    
3533           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
3534    
3535         is another pattern which matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
3536         three characters that are not "999".         three characters that are not "999".
3537    
        Assertion subpatterns are not capturing subpatterns,  and  may  not  be  
        repeated,  because  it  makes no sense to assert the same thing several  
        times. If any kind of assertion contains capturing  subpatterns  within  
        it,  these are counted for the purposes of numbering the capturing sub-  
        patterns in the whole pattern.  However, substring capturing is carried  
        out  only  for  positive assertions, because it does not make sense for  
        negative assertions.  
   
3538    
3539  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
3540    
3541         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
3542         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
3543         on the  result  of  an  assertion,  or  whether  a  previous  capturing         on  the result of an assertion, or whether a previous capturing subpat-
3544         subpattern  matched  or not. The two possible forms of conditional sub-         tern matched or not. The two possible forms of  conditional  subpattern
3545         pattern are         are
3546    
3547           (?(condition)yes-pattern)           (?(condition)yes-pattern)
3548           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
3549    
3550         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
3551         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
3552         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
3553    
3554         There are three kinds of condition. If the text between the parentheses         There are three kinds of condition. If the text between the parentheses
3555         consists  of  a  sequence  of digits, the condition is satisfied if the         consists of a sequence of digits, the condition  is  satisfied  if  the
3556         capturing subpattern of that number has previously matched. The  number         capturing  subpattern of that number has previously matched. The number
3557         must  be  greater than zero. Consider the following pattern, which con-         must be greater than zero. Consider the following pattern,  which  con-
3558         tains non-significant white space to make it more readable (assume  the         tains  non-significant white space to make it more readable (assume the
3559         PCRE_EXTENDED  option)  and  to  divide it into three parts for ease of         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
3560         discussion:         discussion:
3561    
3562           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
3563    
3564         The first part matches an optional opening  parenthesis,  and  if  that         The  first  part  matches  an optional opening parenthesis, and if that
3565         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
3566         ond part matches one or more characters that are not  parentheses.  The         ond  part  matches one or more characters that are not parentheses. The
3567         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether the first set
3568         of parentheses matched or not. If they did, that is, if subject started         of parentheses matched or not. If they did, that is, if subject started
3569         with an opening parenthesis, the condition is true, and so the yes-pat-         with an opening parenthesis, the condition is true, and so the yes-pat-
3570         tern is executed and a  closing  parenthesis  is  required.  Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
3571         since  no-pattern  is  not  present, the subpattern matches nothing. In         since no-pattern is not present, the  subpattern  matches  nothing.  In
3572         other words,  this  pattern  matches  a  sequence  of  non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
3573         optionally enclosed in parentheses.         optionally enclosed in parentheses.
3574    
3575         If the condition is the string (R), it is satisfied if a recursive call         If the condition is the string (R), it is satisfied if a recursive call
3576         to the pattern or subpattern has been made. At "top level", the  condi-         to  the pattern or subpattern has been made. At "top level", the condi-
3577         tion  is  false.   This  is  a  PCRE  extension. Recursive patterns are         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
3578         described in the next section.         described in the next section.
3579    
3580