/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 6  synopses of each function in the library Line 6  synopses of each function in the library
6  separate text files for the pcregrep and pcretest commands.  separate text files for the pcregrep and pcretest commands.
7  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
8    
 PCRE(3)                                                                PCRE(3)  
9    
10    PCRE(3)                                                                PCRE(3)
11    
12    
13  NAME  NAME
14         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
15    
16    
17  INTRODUCTION  INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few  differences.  The current implementation of PCRE (release
22         5.x) corresponds approximately with Perl  5.8,  including  support  for         6.x) corresponds approximately with Perl  5.8,  including  support  for
23         UTF-8 encoded strings and Unicode general category properties. However,         UTF-8 encoded strings and Unicode general category properties. However,
24         this support has to be explicitly enabled; it is not the default.         this support has to be explicitly enabled; it is not the default.
25    
26           In addition to the Perl-compatible matching function,  PCRE  also  con-
27           tains  an  alternative matching function that matches the same compiled
28           patterns in a different way. In certain circumstances, the  alternative
29           function  has  some  advantages.  For  a discussion of the two matching
30           algorithms, see the pcrematching page.
31    
32         PCRE is written in C and released as a C library. A  number  of  people         PCRE is written in C and released as a C library. A  number  of  people
33         have  written  wrappers and interfaces of various kinds. A C++ class is         have  written  wrappers and interfaces of various kinds. In particular,
34         included in these contributions, which can  be  found  in  the  Contrib         Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now
35         directory at the primary FTP site, which is:         included as part of the PCRE distribution. The pcrecpp page has details
36           of this interface. Other people's contributions can  be  found  in  the
37           Contrib directory at the primary FTP site, which is:
38    
39         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
40    
# Line 40  INTRODUCTION Line 49  INTRODUCTION
49         ing  PCRE for various operating systems can be found in the README file         ing  PCRE for various operating systems can be found in the README file
50         in the source distribution.         in the source distribution.
51    
52           The library contains a number of undocumented  internal  functions  and
53           data  tables  that  are  used by more than one of the exported external
54           functions, but which are not intended  for  use  by  external  callers.
55           Their  names  all begin with "_pcre_", which hopefully will not provoke
56           any name clashes. In some environments, it is possible to control which
57           external  symbols  are  exported when a shared library is built, and in
58           these cases the undocumented symbols are not exported.
59    
60    
61  USER DOCUMENTATION  USER DOCUMENTATION
62    
# Line 50  USER DOCUMENTATION Line 67  USER DOCUMENTATION
67         of searching. The sections are as follows:         of searching. The sections are as follows:
68    
69           pcre              this document           pcre              this document
70           pcreapi           details of PCRE's native API           pcreapi           details of PCRE's native C API
71           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
72           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
73           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
74             pcrecpp           details of the C++ wrapper
75           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
76             pcrematching      discussion of the two matching algorithms
77           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
78           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
79                               regular expressions                               regular expressions
80           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
81           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible C API
82           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
83           pcresample        discussion of the sample program           pcresample        discussion of the sample program
84             pcrestack         discussion of stack usage
85           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
86    
87         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
88         each library function, listing its arguments and results.         each C library function, listing its arguments and results.
89    
90    
91  LIMITATIONS  LIMITATIONS
# Line 81  LIMITATIONS Line 101  LIMITATIONS
101         In these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
102         of execution will be slower.         of execution will be slower.
103    
104         All values in repeating quantifiers must be less than 65536.  The maxi-         All  values in repeating quantifiers must be less than 65536. The maxi-
105         mum number of capturing subpatterns is 65535.         mum compiled length of subpattern with  an  explicit  repeat  count  is
106           30000 bytes. The maximum number of capturing subpatterns is 65535.
107    
108         There is no limit to the number of non-capturing subpatterns,  but  the         There  is  no limit to the number of non-capturing subpatterns, but the
109         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         maximum depth of nesting of  all  kinds  of  parenthesized  subpattern,
110         including capturing subpatterns, assertions, and other types of subpat-         including capturing subpatterns, assertions, and other types of subpat-
111         tern, is 200.         tern, is 200.
112    
113           The maximum length of name for a named subpattern is 32, and the  maxi-
114           mum number of named subpatterns is 10000.
115    
116         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
117         that an integer variable can hold. However, PCRE uses recursion to han-         that an integer variable can hold. However, when using the  traditional
118         dle  subpatterns  and indefinite repetition. This means that the avail-         matching function, PCRE uses recursion to handle subpatterns and indef-
119         able stack space may limit the size of a subject  string  that  can  be         inite repetition.  This means that the available stack space may  limit
120         processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
121           For a discussion of stack issues, see the pcrestack documentation.
122    
123    
124  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
125    
126         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
127         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
128         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
129         port for Unicode general category properties was added.         port for Unicode general category properties was added.
130    
131         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
132         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
133         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
134         any  subject  strings  that are matched against it are treated as UTF-8         any subject strings that are matched against it are  treated  as  UTF-8
135         strings instead of just strings of bytes.         strings instead of just strings of bytes.
136    
137         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
138         the  library will be a bit bigger, but the additional run time overhead         the library will be a bit bigger, but the additional run time  overhead
139         is limited to testing the PCRE_UTF8 flag in several places,  so  should         is  limited  to testing the PCRE_UTF8 flag in several places, so should
140         not be very large.         not be very large.
141    
142         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
143         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
144         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
145         general category properties such as Lu for an upper case letter  or  Nd         general  category  properties such as Lu for an upper case letter or Nd
146         for  a decimal number. A full list is given in the pcrepattern documen-         for a decimal number, the Unicode script names such as Arabic  or  Han,
147         tation. The PCRE library is increased in size by about 90K when Unicode         and  the  derived  properties  Any  and L&. A full list is given in the
148         property support is included.         pcrepattern documentation. Only the short names for properties are sup-
149           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
150           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
151           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
152           does not support this.
153    
154         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
155    
156         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
157         subjects are checked for validity on entry to the  relevant  functions.         subjects  are  checked for validity on entry to the relevant functions.
158         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
159         situations, you may already know  that  your  strings  are  valid,  and         situations,  you  may  already  know  that  your strings are valid, and
160         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
161         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
162         PCRE  assumes  that  the  pattern or subject it is given (respectively)         PCRE assumes that the pattern or subject  it  is  given  (respectively)
163         contains only valid UTF-8 codes. In this case, it does not diagnose  an         contains  only valid UTF-8 codes. In this case, it does not diagnose an
164         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
165         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
166         crash.         crash.
167    
168         2. In a pattern, the escape sequence \x{...}, where the contents of the         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
169         braces is a string of hexadecimal digits, is  interpreted  as  a  UTF-8         two-byte UTF-8 character if the value is greater than 127.
        character  whose code number is the given hexadecimal number, for exam-  
        ple: \x{1234}. If a non-hexadecimal digit appears between  the  braces,  
        the item is not recognized.  This escape sequence can be used either as  
        a literal, or within a character class.  
170    
171         3. The original hexadecimal escape sequence, \xhh, matches  a  two-byte         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
172         UTF-8 character if the value is greater than 127.         characters for values greater than \177.
173    
174         4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
175         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
176    
177         5. The dot metacharacter matches one UTF-8 character instead of a  sin-         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
178         gle byte.         gle byte.
179    
180         6.  The  escape sequence \C can be used to match a single byte in UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
181         mode, but its use can lead to some strange effects.         mode,  but  its  use can lead to some strange effects. This facility is
182           not available in the alternative matching function, pcre_dfa_exec().
183    
184         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
185         test  characters of any code value, but the characters that PCRE recog-         test  characters of any code value, but the characters that PCRE recog-
# Line 172  UTF-8 AND UNICODE PROPERTY SUPPORT Line 198  UTF-8 AND UNICODE PROPERTY SUPPORT
198         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
199         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
200         so as not to degrade performance.  The Unicode property information  is         so as not to degrade performance.  The Unicode property information  is
201         used only for characters with higher values.         used only for characters with higher values. Even when Unicode property
202           support is available, PCRE supports case-insensitive matching only when
203           there  is  a  one-to-one  mapping between a letter's cases. There are a
204           small number of many-to-one mappings in Unicode;  these  are  not  sup-
205           ported by PCRE.
206    
207    
208  AUTHOR  AUTHOR
209    
210         Philip Hazel <ph10@cam.ac.uk>         Philip Hazel
211         University Computing Service,         University Computing Service,
212         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
        Phone: +44 1223 334714  
213    
214  Last updated: 09 September 2004         Putting  an actual email address here seems to have been a spam magnet,
215  Copyright (c) 1997-2004 University of Cambridge.         so I've taken it away. If you want to email me, use my initial and sur-
216  -----------------------------------------------------------------------------         name, separated by a dot, at the domain ucs.cam.ac.uk.
217    
218  PCRE(3)                                                                PCRE(3)  Last updated: 05 June 2006
219    Copyright (c) 1997-2006 University of Cambridge.
220    ------------------------------------------------------------------------------
221    
222    
223    PCREBUILD(3)                                                      PCREBUILD(3)
224    
225    
226  NAME  NAME
227         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
228    
229    
230  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
231    
232         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
# Line 212  PCRE BUILD-TIME OPTIONS Line 246  PCRE BUILD-TIME OPTIONS
246         not described.         not described.
247    
248    
249    C++ SUPPORT
250    
251           By default, the configure script will search for a C++ compiler and C++
252           header files. If it finds them, it automatically builds the C++ wrapper
253           library for PCRE. You can disable this by adding
254    
255             --disable-cpp
256    
257           to the configure command.
258    
259    
260  UTF-8 SUPPORT  UTF-8 SUPPORT
261    
262         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 character strings, add
# Line 245  UNICODE CHARACTER PROPERTY SUPPORT Line 290  UNICODE CHARACTER PROPERTY SUPPORT
290    
291  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
292    
293         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating
294         ter. This is the normal newline character on Unix-like systems. You can         the  end  of  a line. This is the normal newline character on Unix-like
295         compile PCRE to use character 13 (carriage return) instead by adding         systems. You can compile PCRE to use character 13 (carriage return, CR)
296           instead, by adding
297    
298           --enable-newline-is-cr           --enable-newline-is-cr
299    
300         to the configure command. For completeness there is  also  a  --enable-         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
301         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
302         line character.  
303           Alternatively, you can specify that line endings are to be indicated by
304           the two character sequence CRLF. If you want this, add
305    
306             --enable-newline-is-crlf
307    
308           to  the  configure command. Whatever line ending convention is selected
309           when PCRE is built can be overridden when  the  library  functions  are
310           called.  At  build time it is conventional to use the standard for your
311           operating system.
312    
313    
314  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
# Line 284  POSIX MALLOC USAGE Line 339  POSIX MALLOC USAGE
339         to the configure command.         to the configure command.
340    
341    
 LIMITING PCRE RESOURCE USAGE  
   
        Internally,  PCRE has a function called match(), which it calls repeat-  
        edly (possibly recursively) when matching a pattern. By controlling the  
        maximum  number  of  times  this function may be called during a single  
        matching operation, a limit can be placed on the resources  used  by  a  
        single  call  to  pcre_exec(). The limit can be changed at run time, as  
        described in the pcreapi documentation. The default is 10 million,  but  
        this can be changed by adding a setting such as  
   
          --with-match-limit=500000  
   
        to the configure command.  
   
   
342  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
343    
344         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
# Line 324  HANDLING VERY LARGE PATTERNS Line 364  HANDLING VERY LARGE PATTERNS
364    
365  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
366    
367         PCRE  implements  backtracking while matching by making recursive calls         When matching with the pcre_exec() function, PCRE implements backtrack-
368         to an internal function called match(). In environments where the  size         ing by making recursive calls to an internal function  called  match().
369         of the stack is limited, this can severely limit PCRE's operation. (The         In  environments  where  the size of the stack is limited, this can se-
370         Unix environment does not usually suffer from this problem.) An  alter-         verely limit PCRE's operation. (The Unix environment does  not  usually
371         native  approach  that  uses  memory  from  the  heap to remember data,         suffer from this problem, but it may sometimes be necessary to increase
372         instead of using recursive function calls, has been implemented to work         the maximum stack size.  There is a discussion in the  pcrestack  docu-
373         round  this  problem. If you want to build a version of PCRE that works         mentation.)  An alternative approach to recursion that uses memory from
374         this way, add         the heap to remember data, instead of using recursive  function  calls,
375           has  been  implemented to work round the problem of limited stack size.
376           If you want to build a version of PCRE that works this way, add
377    
378           --disable-stack-for-recursion           --disable-stack-for-recursion
379    
# Line 342  AVOIDING EXCESSIVE STACK USAGE Line 384  AVOIDING EXCESSIVE STACK USAGE
384         the blocks are always freed in reverse order. A calling  program  might         the blocks are always freed in reverse order. A calling  program  might
385         be  able  to implement optimized functions that perform better than the         be  able  to implement optimized functions that perform better than the
386         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
387         slowly when built in this way.         slowly when built in this way. This option affects only the pcre_exec()
388           function; it is not relevant for the the pcre_dfa_exec() function.
389    
390    
391    LIMITING PCRE RESOURCE USAGE
392    
393           Internally, PCRE has a function called match(), which it calls  repeat-
394           edly   (sometimes   recursively)  when  matching  a  pattern  with  the
395           pcre_exec() function. By controlling the maximum number of  times  this
396           function  may be called during a single matching operation, a limit can
397           be placed on the resources used by a single call  to  pcre_exec().  The
398           limit  can be changed at run time, as described in the pcreapi documen-
399           tation. The default is 10 million, but this can be changed by adding  a
400           setting such as
401    
402             --with-match-limit=500000
403    
404           to   the   configure  command.  This  setting  has  no  effect  on  the
405           pcre_dfa_exec() matching function.
406    
407           In some environments it is desirable to limit the  depth  of  recursive
408           calls of match() more strictly than the total number of calls, in order
409           to restrict the maximum amount of stack (or heap,  if  --disable-stack-
410           for-recursion is specified) that is used. A second limit controls this;
411           it defaults to the value that  is  set  for  --with-match-limit,  which
412           imposes  no  additional constraints. However, you can set a lower limit
413           by adding, for example,
414    
415             --with-match-limit-recursion=10000
416    
417           to the configure command. This value can  also  be  overridden  at  run
418           time.
419    
420    
421  USING EBCDIC CODE  USING EBCDIC CODE
# Line 356  USING EBCDIC CODE Line 429  USING EBCDIC CODE
429    
430         to the configure command.         to the configure command.
431    
432  Last updated: 09 September 2004  Last updated: 06 June 2006
433  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
434  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
435    
 PCRE(3)                                                                PCRE(3)  
436    
437    PCREMATCHING(3)                                                PCREMATCHING(3)
438    
439    
440  NAME  NAME
441         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
442    
443    
444    PCRE MATCHING ALGORITHMS
445    
446           This document describes the two different algorithms that are available
447           in PCRE for matching a compiled regular expression against a given sub-
448           ject  string.  The  "standard"  algorithm  is  the  one provided by the
449           pcre_exec() function.  This works in the same was  as  Perl's  matching
450           function, and provides a Perl-compatible matching operation.
451    
452           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
453           this operates in a different way, and is not  Perl-compatible.  It  has
454           advantages  and disadvantages compared with the standard algorithm, and
455           these are described below.
456    
457           When there is only one possible way in which a given subject string can
458           match  a pattern, the two algorithms give the same answer. A difference
459           arises, however, when there are multiple possibilities. For example, if
460           the pattern
461    
462             ^<.*>
463    
464           is matched against the string
465    
466             <something> <something else> <something further>
467    
468           there are three possible answers. The standard algorithm finds only one
469           of them, whereas the DFA algorithm finds all three.
470    
471    
472    REGULAR EXPRESSIONS AS TREES
473    
474           The set of strings that are matched by a regular expression can be rep-
475           resented  as  a  tree structure. An unlimited repetition in the pattern
476           makes the tree of infinite size, but it is still a tree.  Matching  the
477           pattern  to a given subject string (from a given starting point) can be
478           thought of as a search of the tree.  There are two  ways  to  search  a
479           tree:  depth-first  and  breadth-first, and these correspond to the two
480           matching algorithms provided by PCRE.
481    
482    
483    THE STANDARD MATCHING ALGORITHM
484    
485           In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-
486           sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a
487           depth-first search of the pattern tree. That is, it  proceeds  along  a
488           single path through the tree, checking that the subject matches what is
489           required. When there is a mismatch, the algorithm  tries  any  alterna-
490           tives  at  the  current point, and if they all fail, it backs up to the
491           previous branch point in the  tree,  and  tries  the  next  alternative
492           branch  at  that  level.  This often involves backing up (moving to the
493           left) in the subject string as well.  The  order  in  which  repetition
494           branches  are  tried  is controlled by the greedy or ungreedy nature of
495           the quantifier.
496    
497           If a leaf node is reached, a matching string has  been  found,  and  at
498           that  point the algorithm stops. Thus, if there is more than one possi-
499           ble match, this algorithm returns the first one that it finds.  Whether
500           this  is the shortest, the longest, or some intermediate length depends
501           on the way the greedy and ungreedy repetition quantifiers are specified
502           in the pattern.
503    
504           Because  it  ends  up  with a single path through the tree, it is rela-
505           tively straightforward for this algorithm to keep  track  of  the  sub-
506           strings  that  are  matched  by portions of the pattern in parentheses.
507           This provides support for capturing parentheses and back references.
508    
509    
510    THE DFA MATCHING ALGORITHM
511    
512           DFA stands for "deterministic finite automaton", but you do not need to
513           understand the origins of that name. This algorithm conducts a breadth-
514           first search of the tree. Starting from the first matching point in the
515           subject,  it scans the subject string from left to right, once, charac-
516           ter by character, and as it does  this,  it  remembers  all  the  paths
517           through the tree that represent valid matches.
518    
519           The  scan  continues until either the end of the subject is reached, or
520           there are no more unterminated paths. At this point,  terminated  paths
521           represent  the different matching possibilities (if there are none, the
522           match has failed).  Thus, if there is more  than  one  possible  match,
523           this algorithm finds all of them, and in particular, it finds the long-
524           est. In PCRE, there is an option to stop the algorithm after the  first
525           match (which is necessarily the shortest) has been found.
526    
527           Note that all the matches that are found start at the same point in the
528           subject. If the pattern
529    
530             cat(er(pillar)?)
531    
532           is matched against the string "the caterpillar catchment",  the  result
533           will  be the three strings "cat", "cater", and "caterpillar" that start
534           at the fourth character of the subject. The algorithm does not automat-
535           ically move on to find matches that start at later positions.
536    
537           There are a number of features of PCRE regular expressions that are not
538           supported by the DFA matching algorithm. They are as follows:
539    
540           1. Because the algorithm finds all  possible  matches,  the  greedy  or
541           ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
542           ungreedy quantifiers are treated in exactly the same way.
543    
544           2. When dealing with multiple paths through the tree simultaneously, it
545           is  not  straightforward  to  keep track of captured substrings for the
546           different matching possibilities, and  PCRE's  implementation  of  this
547           algorithm does not attempt to do this. This means that no captured sub-
548           strings are available.
549    
550           3. Because no substrings are captured, back references within the  pat-
551           tern are not supported, and cause errors if encountered.
552    
553           4.  For  the same reason, conditional expressions that use a backrefer-
554           ence as the condition are not supported.
555    
556           5. Callouts are supported, but the value of the  capture_top  field  is
557           always 1, and the value of the capture_last field is always -1.
558    
559           6.  The \C escape sequence, which (in the standard algorithm) matches a
560           single byte, even in UTF-8 mode, is not supported because the DFA algo-
561           rithm moves through the subject string one character at a time, for all
562           active paths through the tree.
563    
564    
565    ADVANTAGES OF THE DFA ALGORITHM
566    
567           Using the DFA matching algorithm provides the following advantages:
568    
569           1. All possible matches (at a single point in the subject) are automat-
570           ically  found,  and  in particular, the longest match is found. To find
571           more than one match using the standard algorithm, you have to do kludgy
572           things with callouts.
573    
574           2.  There is much better support for partial matching. The restrictions
575           on the content of the pattern that apply when using the standard  algo-
576           rithm  for partial matching do not apply to the DFA algorithm. For non-
577           anchored patterns, the starting position of a partial match  is  avail-
578           able.
579    
580           3.  Because  the  DFA algorithm scans the subject string just once, and
581           never needs to backtrack, it is possible  to  pass  very  long  subject
582           strings  to  the matching function in several pieces, checking for par-
583           tial matching each time.
584    
585    
586    DISADVANTAGES OF THE DFA ALGORITHM
587    
588           The DFA algorithm suffers from a number of disadvantages:
589    
590           1. It is substantially slower than  the  standard  algorithm.  This  is
591           partly  because  it has to search for all possible matches, but is also
592           because it is less susceptible to optimization.
593    
594           2. Capturing parentheses and back references are not supported.
595    
596           3. The "atomic group" feature of PCRE regular expressions is supported,
597           but  does not provide the advantage that it does for the standard algo-
598           rithm.
599    
600    Last updated: 06 June 2006
601    Copyright (c) 1997-2006 University of Cambridge.
602    ------------------------------------------------------------------------------
603    
604    
605    PCREAPI(3)                                                          PCREAPI(3)
606    
607    
608    NAME
609           PCRE - Perl-compatible regular expressions
610    
611    
612  PCRE NATIVE API  PCRE NATIVE API
613    
614         #include <pcre.h>         #include <pcre.h>
# Line 375  PCRE NATIVE API Line 617  PCRE NATIVE API
617              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
618              const unsigned char *tableptr);              const unsigned char *tableptr);
619    
620           pcre *pcre_compile2(const char *pattern, int options,
621                int *errorcodeptr,
622                const char **errptr, int *erroffset,
623                const unsigned char *tableptr);
624    
625         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
626              const char **errptr);              const char **errptr);
627    
# Line 382  PCRE NATIVE API Line 629  PCRE NATIVE API
629              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
630              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
631    
632           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
633                const char *subject, int length, int startoffset,
634                int options, int *ovector, int ovecsize,
635                int *workspace, int wscount);
636    
637         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
638              const char *subject, int *ovector,              const char *subject, int *ovector,
639              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 399  PCRE NATIVE API Line 651  PCRE NATIVE API
651         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
652              const char *name);              const char *name);
653    
654           int pcre_get_stringtable_entries(const pcre *code,
655                const char *name, char **first, char **last);
656    
657         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
658              int stringcount, int stringnumber,              int stringcount, int stringnumber,
659              const char **stringptr);              const char **stringptr);
# Line 417  PCRE NATIVE API Line 672  PCRE NATIVE API
672    
673         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
674    
675           int pcre_refcount(pcre *code, int adjust);
676    
677         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
678    
679         char *pcre_version(void);         char *pcre_version(void);
# Line 436  PCRE API OVERVIEW Line 693  PCRE API OVERVIEW
693    
694         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
695         is also a set of wrapper functions that correspond to the POSIX regular         is also a set of wrapper functions that correspond to the POSIX regular
696         expression API.  These are described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
697           Both of these APIs define a set of C function calls. A C++  wrapper  is
698           distributed with PCRE. It is documented in the pcrecpp page.
699    
700         The  native  API  function  prototypes  are  defined in the header file         The  native  API  C  function prototypes are defined in the header file
701         pcre.h, and on Unix systems the library itself is  called  libpcre.  It         pcre.h, and on Unix systems the library itself is called  libpcre.   It
702         can normally be accessed by adding -lpcre to the command for linking an         can normally be accessed by adding -lpcre to the command for linking an
703         application  that  uses  PCRE.  The  header  file  defines  the  macros         application  that  uses  PCRE.  The  header  file  defines  the  macros
704         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
705         bers for the library.  Applications can use these  to  include  support         bers for the library.  Applications can use these  to  include  support
706         for different releases of PCRE.         for different releases of PCRE.
707    
708         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
709         for compiling and matching regular expressions. A sample  program  that         pcre_exec() are used for compiling and matching regular expressions  in
710         demonstrates  the  simplest  way  of using them is provided in the file         a  Perl-compatible  manner. A sample program that demonstrates the sim-
711         called pcredemo.c in the source distribution. The pcresample documenta-         plest way of using them is provided in the file  called  pcredemo.c  in
712         tion describes how to run it.         the  source distribution. The pcresample documentation describes how to
713           run it.
714         In  addition  to  the  main compiling and matching functions, there are  
715         convenience functions for extracting captured substrings from a matched         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
716         subject string.  They are:         ble,  is  also provided. This uses a different algorithm for the match-
717           ing. The alternative algorithm finds all possible matches (at  a  given
718           point in the subject). However, this algorithm does not return captured
719           substrings. A description of the  two  matching  algorithms  and  their
720           advantages  and  disadvantages  is given in the pcrematching documenta-
721           tion.
722    
723           In addition to the main compiling and  matching  functions,  there  are
724           convenience functions for extracting captured substrings from a subject
725           string that is matched by pcre_exec(). They are:
726    
727           pcre_copy_substring()           pcre_copy_substring()
728           pcre_copy_named_substring()           pcre_copy_named_substring()
# Line 462  PCRE API OVERVIEW Line 730  PCRE API OVERVIEW
730           pcre_get_named_substring()           pcre_get_named_substring()
731           pcre_get_substring_list()           pcre_get_substring_list()
732           pcre_get_stringnumber()           pcre_get_stringnumber()
733             pcre_get_stringtable_entries()
734    
735         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
736         to free the memory used for extracted strings.         to free the memory used for extracted strings.
737    
738         The function pcre_maketables() is used to  build  a  set  of  character         The  function  pcre_maketables()  is  used  to build a set of character
739         tables   in  the  current  locale  for  passing  to  pcre_compile()  or         tables  in  the  current  locale   for   passing   to   pcre_compile(),
740         pcre_exec().  This is an optional facility that is  provided  for  spe-         pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
741         cialist use. Most commonly, no special tables are passed, in which case         provided for specialist use.  Most  commonly,  no  special  tables  are
742         internal tables that are generated when PCRE is built are used.         passed,  in  which case internal tables that are generated when PCRE is
743           built are used.
744    
745         The function pcre_fullinfo() is used to find out  information  about  a         The function pcre_fullinfo() is used to find out  information  about  a
746         compiled  pattern; pcre_info() is an obsolete version that returns only         compiled  pattern; pcre_info() is an obsolete version that returns only
# Line 478  PCRE API OVERVIEW Line 748  PCRE API OVERVIEW
748         patibility.   The function pcre_version() returns a pointer to a string         patibility.   The function pcre_version() returns a pointer to a string
749         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
750    
751           The function pcre_refcount() maintains a  reference  count  in  a  data
752           block  containing  a compiled pattern. This is provided for the benefit
753           of object-oriented applications.
754    
755         The global variables pcre_malloc and pcre_free  initially  contain  the         The global variables pcre_malloc and pcre_free  initially  contain  the
756         entry  points  of  the  standard malloc() and free() functions, respec-         entry  points  of  the  standard malloc() and free() functions, respec-
757         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
# Line 487  PCRE API OVERVIEW Line 761  PCRE API OVERVIEW
761         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
762         indirections  to  memory  management functions. These special functions         indirections  to  memory  management functions. These special functions
763         are used only when PCRE is compiled to use  the  heap  for  remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
764         data,  instead  of recursive function calls. This is a non-standard way         data, instead of recursive function calls, when running the pcre_exec()
765         of building PCRE, for use in environments  that  have  limited  stacks.         function. See the pcrebuild documentation for  details  of  how  to  do
766         Because  of  the greater use of memory management, it runs more slowly.         this.  It  is  a non-standard way of building PCRE, for use in environ-
767         Separate functions are provided so that special-purpose  external  code         ments that have limited stacks. Because of the greater  use  of  memory
768         can be used for this case. When used, these functions are always called         management,  it  runs  more  slowly. Separate functions are provided so
769         in a stack-like manner (last obtained, first  freed),  and  always  for         that special-purpose external code can be  used  for  this  case.  When
770         memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
771           obtained, first freed), and always for memory blocks of the same  size.
772           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
773           mentation.
774    
775         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
776         by the caller to a "callout" function, which PCRE  will  then  call  at         by  the  caller  to  a "callout" function, which PCRE will then call at
777         specified  points during a matching operation. Details are given in the         specified points during a matching operation. Details are given in  the
778         pcrecallout documentation.         pcrecallout documentation.
779    
780    
781    NEWLINES
782           PCRE supports three different conventions for indicating line breaks in
783           strings: a single CR character, a single LF character, or the two-char-
784           acter  sequence  CRLF.  All  three  are used as "standard" by different
785           operating systems.  When PCRE is built, a default can be specified. The
786           default  default  is  LF, which is the Unix standard. When PCRE is run,
787           the default can be overridden, either when a pattern  is  compiled,  or
788           when it is matched.
789    
790           In the PCRE documentation the word "newline" is used to mean "the char-
791           acter or pair of characters that indicate a line break".
792    
793    
794  MULTITHREADING  MULTITHREADING
795    
796         The PCRE functions can be used in  multi-threading  applications,  with         The PCRE functions can be used in  multi-threading  applications,  with
# Line 547  CHECKING BUILD-TIME OPTIONS Line 837  CHECKING BUILD-TIME OPTIONS
837    
838           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
839    
840         The  output  is an integer that is set to the value of the code that is         The  output  is  an integer whose value specifies the default character
841         used for the newline character. It is either linefeed (10) or  carriage         sequence that is recognized as meaning "newline". The three values that
842         return  (13),  and  should  normally be the standard character for your         are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default
843         operating system.         should normally be the standard sequence for your operating system.
844    
845           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
846    
# Line 573  CHECKING BUILD-TIME OPTIONS Line 863  CHECKING BUILD-TIME OPTIONS
863         internal matching function calls in a  pcre_exec()  execution.  Further         internal matching function calls in a  pcre_exec()  execution.  Further
864         details are given with pcre_exec() below.         details are given with pcre_exec() below.
865    
866             PCRE_CONFIG_MATCH_LIMIT_RECURSION
867    
868           The  output is an integer that gives the default limit for the depth of
869           recursion when calling the internal matching function in a  pcre_exec()
870           execution. Further details are given with pcre_exec() below.
871    
872           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
873    
874         The  output  is  an integer that is set to one if internal recursion is         The  output is an integer that is set to one if internal recursion when
875         implemented by recursive function calls that use the stack to  remember         running pcre_exec() is implemented by recursive function calls that use
876         their state. This is the usual way that PCRE is compiled. The output is         the  stack  to remember their state. This is the usual way that PCRE is
877         zero if PCRE was compiled to use blocks of data on the heap instead  of         compiled. The output is zero if PCRE was compiled to use blocks of data
878         recursive   function   calls.   In  this  case,  pcre_stack_malloc  and         on  the  heap  instead  of  recursive  function  calls.  In  this case,
879         pcre_stack_free are called to manage memory blocks on  the  heap,  thus         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
880         avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
881    
882    
883  COMPILING A PATTERN  COMPILING A PATTERN
# Line 590  COMPILING A PATTERN Line 886  COMPILING A PATTERN
886              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
887              const unsigned char *tableptr);              const unsigned char *tableptr);
888    
889         The  function  pcre_compile()  is  called  to compile a pattern into an         pcre *pcre_compile2(const char *pattern, int options,
890         internal form. The pattern is a C string terminated by a  binary  zero,              int *errorcodeptr,
891         and  is  passed in the pattern argument. A pointer to a single block of              const char **errptr, int *erroffset,
892         memory that is obtained via pcre_malloc is returned. This contains  the              const unsigned char *tableptr);
893         compiled  code  and  related  data.  The  pcre  type is defined for the  
894         returned block; this is a typedef for a structure  whose  contents  are         Either of the functions pcre_compile() or pcre_compile2() can be called
895         not  externally defined. It is up to the caller to free the memory when         to compile a pattern into an internal form. The only difference between
896         it is no longer required.         the  two interfaces is that pcre_compile2() has an additional argument,
897           errorcodeptr, via which a numerical error code can be returned.
898    
899           The pattern is a C string terminated by a binary zero, and is passed in
900           the  pattern  argument.  A  pointer to a single block of memory that is
901           obtained via pcre_malloc is returned. This contains the  compiled  code
902           and related data. The pcre type is defined for the returned block; this
903           is a typedef for a structure whose contents are not externally defined.
904           It is up to the caller to free the memory (via pcre_free) when it is no
905           longer required.
906    
907         Although the compiled code of a PCRE regex is relocatable, that is,  it         Although the compiled code of a PCRE regex is relocatable, that is,  it
908         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
# Line 611  COMPILING A PATTERN Line 916  COMPILING A PATTERN
916         pattern  (see  the  detailed  description in the pcrepattern documenta-         pattern  (see  the  detailed  description in the pcrepattern documenta-
917         tion). For these options, the contents of the options  argument  speci-         tion). For these options, the contents of the options  argument  speci-
918         fies  their initial settings at the start of compilation and execution.         fies  their initial settings at the start of compilation and execution.
919         The PCRE_ANCHORED option can be set at the time of matching as well  as         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time
920         at compile time.         of matching as well as at compile time.
921    
922         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
923         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
924         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
925         sage. The offset from the start of the pattern to the  character  where         sage. This is a static string that is part of the library. You must not
926         the  error  was  discovered  is  placed  in  the variable pointed to by         try to free it. The offset from the start of the pattern to the charac-
927         erroffset, which must not be NULL. If it  is,  an  immediate  error  is         ter where the error was discovered is placed in the variable pointed to
928           by  erroffset,  which must not be NULL. If it is, an immediate error is
929         given.         given.
930    
931           If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
932           codeptr  argument is not NULL, a non-zero error code number is returned
933           via this argument in the event of an error. This is in addition to  the
934           textual error message. Error codes and messages are listed below.
935    
936         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
937         character tables that are  built  when  PCRE  is  compiled,  using  the         character tables that are  built  when  PCRE  is  compiled,  using  the
938         default  C  locale.  Otherwise, tableptr must be an address that is the         default  C  locale.  Otherwise, tableptr must be an address that is the
# Line 664  COMPILING A PATTERN Line 975  COMPILING A PATTERN
975    
976         If  this  bit is set, letters in the pattern match both upper and lower         If  this  bit is set, letters in the pattern match both upper and lower
977         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
978         changed  within  a  pattern  by  a (?i) option setting. When running in         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
979         UTF-8 mode, case support for high-valued characters is  available  only         always understands the concept of case for characters whose values  are
980         when PCRE is built with Unicode character property support.         less  than 128, so caseless matching is always possible. For characters
981           with higher values, the concept of case is supported if  PCRE  is  com-
982           piled  with Unicode property support, but not otherwise. If you want to
983           use caseless matching for characters 128 and  above,  you  must  ensure
984           that  PCRE  is  compiled  with Unicode property support as well as with
985           UTF-8 support.
986    
987           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
988    
989         If  this bit is set, a dollar metacharacter in the pattern matches only         If this bit is set, a dollar metacharacter in the pattern matches  only
990         at the end of the subject string. Without this option,  a  dollar  also         at  the  end  of the subject string. Without this option, a dollar also
991         matches  immediately before the final character if it is a newline (but         matches immediately before a newline at the end of the string (but  not
992         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
993         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
994         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
995    
996           PCRE_DOTALL           PCRE_DOTALL
997    
998         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
999         acters,  including  newlines.  Without  it, newlines are excluded. This         acters, including those that indicate newline. Without it, a  dot  does
1000         option is equivalent to Perl's /s option, and it can be changed  within         not  match  when  the  current position is at a newline. This option is
1001         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]         equivalent to Perl's /s option, and it can be changed within a  pattern
1002         always matches a newline character, independent of the setting of  this         by  a (?s) option setting. A negative class such as [^a] always matches
1003         option.         newlines, independent of the setting of this option.
1004    
1005             PCRE_DUPNAMES
1006    
1007           If this bit is set, names used to identify capturing  subpatterns  need
1008           not be unique. This can be helpful for certain types of pattern when it
1009           is known that only one instance of the named  subpattern  can  ever  be
1010           matched.  There  are  more details of named subpatterns below; see also
1011           the pcrepattern documentation.
1012    
1013           PCRE_EXTENDED           PCRE_EXTENDED
1014    
1015         If  this  bit  is  set,  whitespace  data characters in the pattern are         If this bit is set, whitespace  data  characters  in  the  pattern  are
1016         totally ignored except  when  escaped  or  inside  a  character  class.         totally ignored except when escaped or inside a character class. White-
1017         Whitespace  does  not  include the VT character (code 11). In addition,         space does not include the VT character (code 11). In addition, charac-
1018         characters between an unescaped # outside a  character  class  and  the         ters between an unescaped # outside a character class and the next new-
1019         next newline character, inclusive, are also ignored. This is equivalent         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
1020         to Perl's /x option, and it can be changed within a pattern by  a  (?x)         option,  and  it  can be changed within a pattern by a (?x) option set-
1021         option setting.         ting.
1022    
1023         This  option  makes  it possible to include comments inside complicated         This option makes it possible to include  comments  inside  complicated
1024         patterns.  Note, however, that this applies only  to  data  characters.         patterns.   Note,  however,  that this applies only to data characters.
1025         Whitespace   characters  may  never  appear  within  special  character         Whitespace  characters  may  never  appear  within  special   character
1026         sequences in a pattern, for  example  within  the  sequence  (?(  which         sequences  in  a  pattern,  for  example  within the sequence (?( which
1027         introduces a conditional subpattern.         introduces a conditional subpattern.
1028    
1029           PCRE_EXTRA           PCRE_EXTRA
1030    
1031         This  option  was invented in order to turn on additional functionality         This option was invented in order to turn on  additional  functionality
1032         of PCRE that is incompatible with Perl, but it  is  currently  of  very         of  PCRE  that  is  incompatible with Perl, but it is currently of very
1033         little  use. When set, any backslash in a pattern that is followed by a         little use. When set, any backslash in a pattern that is followed by  a
1034         letter that has no special meaning  causes  an  error,  thus  reserving         letter  that  has  no  special  meaning causes an error, thus reserving
1035         these  combinations  for  future  expansion.  By default, as in Perl, a         these combinations for future expansion. By  default,  as  in  Perl,  a
1036         backslash followed by a letter with no special meaning is treated as  a         backslash  followed by a letter with no special meaning is treated as a
1037         literal.  There  are  at  present  no other features controlled by this         literal. (Perl can, however, be persuaded to give a warning for  this.)
1038         option. It can also be set by a (?X) option setting within a pattern.         There  are  at  present no other features controlled by this option. It
1039           can also be set by a (?X) option setting within a pattern.
1040    
1041             PCRE_FIRSTLINE
1042    
1043           If this option is set, an  unanchored  pattern  is  required  to  match
1044           before  or  at  the  first  newline  in  the subject string, though the
1045           matched text may continue over the newline.
1046    
1047           PCRE_MULTILINE           PCRE_MULTILINE
1048    
# Line 723  COMPILING A PATTERN Line 1054  COMPILING A PATTERN
1054         is set). This is the same as Perl.         is set). This is the same as Perl.
1055    
1056         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1057         constructs match immediately following or immediately before  any  new-         constructs match immediately following or immediately  before  internal
1058         line  in the subject string, respectively, as well as at the very start         newlines  in  the  subject string, respectively, as well as at the very
1059         and end. This is equivalent to Perl's /m option, and it can be  changed         start and end. This is equivalent to Perl's /m option, and  it  can  be
1060         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1061         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1062         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1063    
1064             PCRE_NEWLINE_CR
1065             PCRE_NEWLINE_LF
1066             PCRE_NEWLINE_CRLF
1067    
1068           These  options  override the default newline definition that was chosen
1069           when PCRE was built. Setting the first or the second specifies  that  a
1070           newline  is  indicated  by a single character (CR or LF, respectively).
1071           Setting both of them specifies that a newline is indicated by the  two-
1072           character  CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined
1073           to contain both bits. The only time that a line break is relevant  when
1074           compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out-
1075           side a character class is encountered. This indicates  a  comment  that
1076           lasts until after the next newline.
1077    
1078           The newline option set at compile time becomes the default that is used
1079           for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1080    
1081           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1082    
1083         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
1084         theses in the pattern. Any opening parenthesis that is not followed  by         theses  in the pattern. Any opening parenthesis that is not followed by
1085         ?  behaves as if it were followed by ?: but named parentheses can still         ? behaves as if it were followed by ?: but named parentheses can  still
1086         be used for capturing (and they acquire  numbers  in  the  usual  way).         be  used  for  capturing  (and  they acquire numbers in the usual way).
1087         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1088    
1089           PCRE_UNGREEDY           PCRE_UNGREEDY
1090    
1091         This  option  inverts  the "greediness" of the quantifiers so that they         This option inverts the "greediness" of the quantifiers  so  that  they
1092         are not greedy by default, but become greedy if followed by "?". It  is         are  not greedy by default, but become greedy if followed by "?". It is
1093         not  compatible  with Perl. It can also be set by a (?U) option setting         not compatible with Perl. It can also be set by a (?U)  option  setting
1094         within the pattern.         within the pattern.
1095    
1096           PCRE_UTF8           PCRE_UTF8
1097    
1098         This option causes PCRE to regard both the pattern and the  subject  as         This  option  causes PCRE to regard both the pattern and the subject as
1099         strings  of  UTF-8 characters instead of single-byte character strings.         strings of UTF-8 characters instead of single-byte  character  strings.
1100         However, it is available only when PCRE is built to include UTF-8  sup-         However,  it is available only when PCRE is built to include UTF-8 sup-
1101         port.  If not, the use of this option provokes an error. Details of how         port. If not, the use of this option provokes an error. Details of  how
1102         this option changes the behaviour of PCRE are given in the  section  on         this  option  changes the behaviour of PCRE are given in the section on
1103         UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1104    
1105           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1106    
1107         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1108         automatically checked. If an invalid UTF-8 sequence of bytes is  found,         automatically  checked. If an invalid UTF-8 sequence of bytes is found,
1109         pcre_compile()  returns an error. If you already know that your pattern         pcre_compile() returns an error. If you already know that your  pattern
1110         is valid, and you want to skip this check for performance reasons,  you         is  valid, and you want to skip this check for performance reasons, you
1111         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         can set the PCRE_NO_UTF8_CHECK option. When it is set,  the  effect  of
1112         passing an invalid UTF-8 string as a pattern is undefined. It may cause         passing an invalid UTF-8 string as a pattern is undefined. It may cause
1113         your  program  to  crash.   Note that this option can also be passed to         your program to crash.  Note that this option can  also  be  passed  to
1114         pcre_exec(),  to  suppress  the  UTF-8  validity  checking  of  subject         pcre_exec()  and pcre_dfa_exec(), to suppress the UTF-8 validity check-
1115         strings.         ing of subject strings.
1116    
1117    
1118    COMPILATION ERROR CODES
1119    
1120           The following table lists the error  codes  than  may  be  returned  by
1121           pcre_compile2(),  along with the error messages that may be returned by
1122           both compiling functions.
1123    
1124              0  no error
1125              1  \ at end of pattern
1126              2  \c at end of pattern
1127              3  unrecognized character follows \
1128              4  numbers out of order in {} quantifier
1129              5  number too big in {} quantifier
1130              6  missing terminating ] for character class
1131              7  invalid escape sequence in character class
1132              8  range out of order in character class
1133              9  nothing to repeat
1134             10  operand of unlimited repeat could match the empty string
1135             11  internal error: unexpected repeat
1136             12  unrecognized character after (?
1137             13  POSIX named classes are supported only within a class
1138             14  missing )
1139             15  reference to non-existent subpattern
1140             16  erroffset passed as NULL
1141             17  unknown option bit(s) set
1142             18  missing ) after comment
1143             19  parentheses nested too deeply
1144             20  regular expression too large
1145             21  failed to get memory
1146             22  unmatched parentheses
1147             23  internal error: code overflow
1148             24  unrecognized character after (?<
1149             25  lookbehind assertion is not fixed length
1150             26  malformed number or name after (?(
1151             27  conditional group contains more than two branches
1152             28  assertion expected after (?(
1153             29  (?R or (?digits must be followed by )
1154             30  unknown POSIX class name
1155             31  POSIX collating elements are not supported
1156             32  this version of PCRE is not compiled with PCRE_UTF8 support
1157             33  spare error
1158             34  character value in \x{...} sequence is too large
1159             35  invalid condition (?(0)
1160             36  \C not allowed in lookbehind assertion
1161             37  PCRE does not support \L, \l, \N, \U, or \u
1162             38  number after (?C is > 255
1163             39  closing ) for (?C expected
1164             40  recursive call could loop indefinitely
1165             41  unrecognized character after (?P
1166             42  syntax error after (?P
1167             43  two named subpatterns have the same name
1168             44  invalid UTF-8 string
1169             45  support for \P, \p, and \X has not been compiled
1170             46  malformed \P or \p sequence
1171             47  unknown property name after \P or \p
1172             48  subpattern name is too long (maximum 32 characters)
1173             49  too many named subpatterns (maximum 10,000)
1174             50  repeated subpattern is too long
1175             51  octal value is greater than \377 (not in UTF-8 mode)
1176    
1177    
1178  STUDYING A PATTERN  STUDYING A PATTERN
1179    
1180         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options
1181              const char **errptr);              const char **errptr);
1182    
1183         If  a  compiled  pattern is going to be used several times, it is worth         If a compiled pattern is going to be used several times,  it  is  worth
1184         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1185         matching.  The function pcre_study() takes a pointer to a compiled pat-         matching. The function pcre_study() takes a pointer to a compiled  pat-
1186         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1187         information  that  will  help speed up matching, pcre_study() returns a         information that will help speed up matching,  pcre_study()  returns  a
1188         pointer to a pcre_extra block, in which the study_data field points  to         pointer  to a pcre_extra block, in which the study_data field points to
1189         the results of the study.         the results of the study.
1190    
1191         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1192         pcre_exec(). However, a pcre_extra block  also  contains  other  fields         pcre_exec().  However,  a  pcre_extra  block also contains other fields
1193         that  can  be  set  by the caller before the block is passed; these are         that can be set by the caller before the block  is  passed;  these  are
1194         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1195    
1196         If studying the pattern does not produce  any  additional  information,         If  studying  the  pattern  does not produce any additional information
1197         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1198         wants to pass any of the other fields to pcre_exec(), it  must  set  up         wants  to  pass  any of the other fields to pcre_exec(), it must set up
1199         its own pcre_extra block.         its own pcre_extra block.
1200    
1201         The  second  argument of pcre_study() contains option bits. At present,         The second argument of pcre_study() contains option bits.  At  present,
1202         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1203    
1204         The third argument for pcre_study() is a pointer for an error  message.         The  third argument for pcre_study() is a pointer for an error message.
1205         If  studying  succeeds  (even  if no data is returned), the variable it         If studying succeeds (even if no data is  returned),  the  variable  it
1206         points to is set to NULL. Otherwise it points to a textual  error  mes-         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1207         sage.  You should therefore test the error pointer for NULL after call-         error message. This is a static string that is part of the library. You
1208         ing pcre_study(), to be sure that it has run successfully.         must  not  try  to  free it. You should test the error pointer for NULL
1209           after calling pcre_study(), to be sure that it has run successfully.
1210    
1211         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1212    
# Line 815  STUDYING A PATTERN Line 1224  STUDYING A PATTERN
1224  LOCALE SUPPORT  LOCALE SUPPORT
1225    
1226         PCRE handles caseless matching, and determines whether  characters  are         PCRE handles caseless matching, and determines whether  characters  are
1227         letters,  digits, or whatever, by reference to a set of tables, indexed         letters  digits,  or whatever, by reference to a set of tables, indexed
1228         by character value. (When running in UTF-8 mode, this applies  only  to         by character value. When running in UTF-8 mode, this  applies  only  to
1229         characters  with  codes  less than 128. Higher-valued codes never match         characters  with  codes  less than 128. Higher-valued codes never match
1230         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1231         with Unicode character property support.)         with  Unicode  character property support. The use of locales with Uni-
1232           code is discouraged.
1233    
1234         An  internal set of tables is created in the default C locale when PCRE         An internal set of tables is created in the default C locale when  PCRE
1235         is built. This is used when the final  argument  of  pcre_compile()  is         is  built.  This  is  used when the final argument of pcre_compile() is
1236         NULL,  and  is  sufficient for many applications. An alternative set of         NULL, and is sufficient for many applications. An  alternative  set  of
1237         tables can, however, be supplied. These may be created in  a  different         tables  can,  however, be supplied. These may be created in a different
1238         locale  from the default. As more and more applications change to using         locale from the default. As more and more applications change to  using
1239         Unicode, the need for this locale support is expected to die away.         Unicode, the need for this locale support is expected to die away.
1240    
1241         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
1242         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
1243         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1244         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
1245         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
1246         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1247    
1248           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1249           tables = pcre_maketables();           tables = pcre_maketables();
1250           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1251    
1252         When  pcre_maketables()  runs,  the  tables are built in memory that is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1253         obtained via pcre_malloc. It is the caller's responsibility  to  ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1254         that  the memory containing the tables remains available for as long as         that the memory containing the tables remains available for as long  as
1255         it is needed.         it is needed.
1256    
1257         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1258         pattern,  and the same tables are used via this pointer by pcre_study()         pattern, and the same tables are used via this pointer by  pcre_study()
1259         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1260         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1261         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1262    
1263         It is possible to pass a table pointer or NULL (indicating the  use  of         It  is  possible to pass a table pointer or NULL (indicating the use of
1264         the  internal  tables)  to  pcre_exec(). Although not intended for this         the internal tables) to pcre_exec(). Although  not  intended  for  this
1265         purpose, this facility could be used to match a pattern in a  different         purpose,  this facility could be used to match a pattern in a different
1266         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1267         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1268    
# Line 862  INFORMATION ABOUT A PATTERN Line 1272  INFORMATION ABOUT A PATTERN
1272         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1273              int what, void *where);              int what, void *where);
1274    
1275         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1276         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1277         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1278    
1279         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1280         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1281         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1282         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1283         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1284         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1285    
1286           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 878  INFORMATION ABOUT A PATTERN Line 1288  INFORMATION ABOUT A PATTERN
1288           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1289           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1290    
1291         The  "magic  number" is placed at the start of each compiled pattern as         The "magic number" is placed at the start of each compiled  pattern  as
1292         an simple check against passing an arbitrary memory pointer. Here is  a         an  simple check against passing an arbitrary memory pointer. Here is a
1293         typical  call  of pcre_fullinfo(), to obtain the length of the compiled         typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1294         pattern:         pattern:
1295    
1296           int rc;           int rc;
1297           unsigned long int length;           size_t length;
1298           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1299             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1300             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
1301             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1302             &length);         /* where to put the data */             &length);         /* where to put the data */
1303    
1304         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1305         are as follows:         are as follows:
1306    
1307           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1308    
1309         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1310         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1311         there are no back references.         there are no back references.
1312    
1313           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1314    
1315         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1316         argument should point to an int variable.         argument should point to an int variable.
1317    
1318           PCRE_INFO_DEFAULTTABLES           PCRE_INFO_DEFAULT_TABLES
1319    
1320         Return a pointer to the internal default character tables within  PCRE.         Return  a pointer to the internal default character tables within PCRE.
1321         The  fourth  argument should point to an unsigned char * variable. This         The fourth argument should point to an unsigned char *  variable.  This
1322         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1323         tion.  External  callers  can  cause PCRE to use its internal tables by         tion. External callers can cause PCRE to use  its  internal  tables  by
1324         passing a NULL table pointer.         passing a NULL table pointer.
1325    
1326           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1327    
1328         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1329         non-anchored    pattern.    (This    option    used    to   be   called         non-anchored pattern. The fourth argument should point to an int  vari-
1330         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1331         compatibility.)         is still recognized for backwards compatibility.)
1332    
1333         If  there  is  a  fixed first byte, for example, from a pattern such as         If there is a fixed first byte, for example, from  a  pattern  such  as
1334         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.         (cat|cow|coyote). Otherwise, if either
        Otherwise, if either  
1335    
1336         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1337         branch starts with "^", or         branch starts with "^", or
# Line 958  INFORMATION ABOUT A PATTERN Line 1367  INFORMATION ABOUT A PATTERN
1367    
1368         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1369         ses. The names are just an additional way of identifying the  parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1370         ses,  which  still  acquire  numbers.  A  convenience  function  called         ses, which still acquire numbers. Several convenience functions such as
1371         pcre_get_named_substring() is provided  for  extracting  an  individual         pcre_get_named_substring() are provided for  extracting  captured  sub-
1372         captured  substring  by  name.  It is also possible to extract the data         strings  by  name. It is also possible to extract the data directly, by
1373         directly, by first converting the name to a number in order  to  access         first converting the name to a number in order to  access  the  correct
1374         the  correct  pointers in the output vector (described with pcre_exec()         pointers in the output vector (described with pcre_exec() below). To do
1375         below). To do the conversion, you need to use the  name-to-number  map,         the conversion, you need  to  use  the  name-to-number  map,  which  is
1376         which is described by these three values.         described by these three values.
1377    
1378         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1379         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
# Line 974  INFORMATION ABOUT A PATTERN Line 1383  INFORMATION ABOUT A PATTERN
1383         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1384         sis,  most  significant byte first. The rest of the entry is the corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1385         sponding name, zero terminated. The names are  in  alphabetical  order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1386         For  example,  consider  the following pattern (assume PCRE_EXTENDED is         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1387         set, so white space - including newlines - is ignored):         theses numbers. For example, consider  the  following  pattern  (assume
1388           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1389           ignored):
1390    
1391           (?P<date> (?P<year>(\d\d)?\d\d) -           (?P<date> (?P<year>(\d\d)?\d\d) -
1392           (?P<month>\d\d) - (?P<day>\d\d) )           (?P<month>\d\d) - (?P<day>\d\d) )
# Line 991  INFORMATION ABOUT A PATTERN Line 1402  INFORMATION ABOUT A PATTERN
1402           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1403    
1404         When  writing  code  to  extract  data from named subpatterns using the         When  writing  code  to  extract  data from named subpatterns using the
1405         name-to-number map, remember that the length of each entry is likely to         name-to-number map, remember that the length of the entries  is  likely
1406         be different for each compiled pattern.         to be different for each compiled pattern.
1407    
1408           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1409    
# Line 1051  OBSOLETE INFO FUNCTION Line 1462  OBSOLETE INFO FUNCTION
1462         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1463    
1464    
1465  MATCHING A PATTERN  REFERENCE COUNTS
1466    
1467           int pcre_refcount(pcre *code, int adjust);
1468    
1469           The  pcre_refcount()  function is used to maintain a reference count in
1470           the data block that contains a compiled pattern. It is provided for the
1471           benefit  of  applications  that  operate  in an object-oriented manner,
1472           where different parts of the application may be using the same compiled
1473           pattern, but you want to free the block when they are all done.
1474    
1475           When a pattern is compiled, the reference count field is initialized to
1476           zero.  It is changed only by calling this function, whose action is  to
1477           add  the  adjust  value  (which may be positive or negative) to it. The
1478           yield of the function is the new value. However, the value of the count
1479           is  constrained to lie between 0 and 65535, inclusive. If the new value
1480           is outside these limits, it is forced to the appropriate limit value.
1481    
1482           Except when it is zero, the reference count is not correctly  preserved
1483           if  a  pattern  is  compiled on one host and then transferred to a host
1484           whose byte-order is different. (This seems a highly unlikely scenario.)
1485    
1486    
1487    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1488    
1489         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1490              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
# Line 1060  MATCHING A PATTERN Line 1493  MATCHING A PATTERN
1493         The  function pcre_exec() is called to match a subject string against a         The  function pcre_exec() is called to match a subject string against a
1494         compiled pattern, which is passed in the code argument. If the  pattern         compiled pattern, which is passed in the code argument. If the  pattern
1495         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1496         argument.         argument. This function is the main matching facility of  the  library,
1497           and it operates in a Perl-like manner. For specialist use there is also
1498           an alternative matching function, which is described below in the  sec-
1499           tion about the pcre_dfa_exec() function.
1500    
1501         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
1502         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
1503         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1504         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
1505         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1506    
1507         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1080  MATCHING A PATTERN Line 1516  MATCHING A PATTERN
1516             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1517             0,              /* default options */             0,              /* default options */
1518             ovector,        /* vector of integers for substring information */             ovector,        /* vector of integers for substring information */
1519             30);            /* number of elements in the vector  (NOT  size  in             30);            /* number of elements (NOT size in bytes) */
        bytes) */  
1520    
1521     Extra data for pcre_exec()     Extra data for pcre_exec()
1522    
1523         If  the  extra argument is not NULL, it must point to a pcre_extra data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1524         block. The pcre_study() function returns such a block (when it  doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1525         return  NULL), but you can also create one for yourself, and pass addi-         return  NULL), but you can also create one for yourself, and pass addi-
1526         tional information in it. The fields in a pcre_extra block are as  fol-         tional information in it. The pcre_extra block contains  the  following
1527         lows:         fields (not necessarily in this order):
1528    
1529           unsigned long int flags;           unsigned long int flags;
1530           void *study_data;           void *study_data;
1531           unsigned long int match_limit;           unsigned long int match_limit;
1532             unsigned long int match_limit_recursion;
1533           void *callout_data;           void *callout_data;
1534           const unsigned char *tables;           const unsigned char *tables;
1535    
# Line 1102  MATCHING A PATTERN Line 1538  MATCHING A PATTERN
1538    
1539           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1540           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1541             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1542           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1543           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1544    
# Line 1118  MATCHING A PATTERN Line 1555  MATCHING A PATTERN
1555         repeats.         repeats.
1556    
1557         Internally, PCRE uses a function called match() which it calls  repeat-         Internally, PCRE uses a function called match() which it calls  repeat-
1558         edly  (sometimes  recursively).  The  limit is imposed on the number of         edly  (sometimes  recursively). The limit set by match_limit is imposed
1559         times this function is called during a match, which has the  effect  of         on the number of times this function is called during  a  match,  which
1560         limiting  the amount of recursion and backtracking that can take place.         has  the  effect  of  limiting the amount of backtracking that can take
1561         For patterns that are not anchored, the count starts from zero for each         place. For patterns that are not anchored, the count restarts from zero
1562         position in the subject string.         for each position in the subject string.
1563    
1564         The  default  limit  for the library can be set when PCRE is built; the         The  default  value  for  the  limit can be set when PCRE is built; the
1565         default default is 10 million, which handles all but the  most  extreme         default default is 10 million, which handles all but the  most  extreme
1566         cases.  You  can  reduce  the  default  by  suppling pcre_exec() with a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1567         pcre_extra block in which match_limit is set to a  smaller  value,  and         pcre_extra    block    in    which    match_limit    is    set,     and
1568         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1569         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1570    
1571           The match_limit_recursion field is similar to match_limit, but  instead
1572           of limiting the total number of times that match() is called, it limits
1573           the depth of recursion. The recursion depth is a  smaller  number  than
1574           the  total number of calls, because not all calls to match() are recur-
1575           sive.  This limit is of use only if it is set smaller than match_limit.
1576    
1577           Limiting  the  recursion  depth  limits the amount of stack that can be
1578           used, or, when PCRE has been compiled to use memory on the heap instead
1579           of the stack, the amount of heap memory that can be used.
1580    
1581           The  default  value  for  match_limit_recursion can be set when PCRE is
1582           built; the default default  is  the  same  value  as  the  default  for
1583           match_limit.  You can override the default by suppling pcre_exec() with
1584           a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
1585           PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
1586           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1587    
1588         The pcre_callout field is used in conjunction with the  "callout"  fea-         The pcre_callout field is used in conjunction with the  "callout"  fea-
1589         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1590    
# Line 1148  MATCHING A PATTERN Line 1602  MATCHING A PATTERN
1602     Option bits for pcre_exec()     Option bits for pcre_exec()
1603    
1604         The unused bits of the options argument for pcre_exec() must  be  zero.         The unused bits of the options argument for pcre_exec() must  be  zero.
1605         The   only  bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1606         PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1607           PCRE_PARTIAL.
1608    
1609           PCRE_ANCHORED           PCRE_ANCHORED
1610    
1611         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
1612         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
1613         turned out to be anchored by virtue of its contents, it cannot be  made         turned  out to be anchored by virtue of its contents, it cannot be made
1614         unachored at matching time.         unachored at matching time.
1615    
1616             PCRE_NEWLINE_CR
1617             PCRE_NEWLINE_LF
1618             PCRE_NEWLINE_CRLF
1619    
1620           These options override  the  newline  definition  that  was  chosen  or
1621           defaulted  when the pattern was compiled. For details, see the descrip-
1622           tion pcre_compile() above. During matching, the newline choice  affects
1623           the behaviour of the dot, circumflex, and dollar metacharacters.
1624    
1625           PCRE_NOTBOL           PCRE_NOTBOL
1626    
1627         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1628         the beginning of a line, so the  circumflex  metacharacter  should  not         the beginning of a line, so the  circumflex  metacharacter  should  not
1629         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1630         causes  circumflex  never  to  match.  This  option  affects  only  the         causes circumflex never to match. This option affects only  the  behav-
1631         behaviour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1632    
1633           PCRE_NOTEOL           PCRE_NOTEOL
1634    
# Line 1293  MATCHING A PATTERN Line 1757  MATCHING A PATTERN
1757         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1758         tor[1], identify the portion of  the  subject  string  matched  by  the         tor[1], identify the portion of  the  subject  string  matched  by  the
1759         entire  pattern.  The next pair is used for the first capturing subpat-         entire  pattern.  The next pair is used for the first capturing subpat-
1760         tern, and so on. The value returned by pcre_exec()  is  the  number  of         tern, and so on. The value returned by pcre_exec() is one more than the
1761         pairs  that  have  been set. If there are no capturing subpatterns, the         highest numbered pair that has been set. For example, if two substrings
1762         return value from a successful match is 1,  indicating  that  just  the         have been captured, the returned value is 3. If there are no  capturing
1763         first pair of offsets has been set.         subpatterns,  the return value from a successful match is 1, indicating
1764           that just the first pair of offsets has been set.
        Some  convenience  functions  are  provided for extracting the captured  
        substrings as separate strings. These are described  in  the  following  
        section.  
   
        It  is  possible  for  an capturing subpattern number n+1 to match some  
        part of the subject when subpattern n has not been  used  at  all.  For  
        example, if the string "abc" is matched against the pattern (a|(z))(bc)  
        subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both  
        offset values corresponding to the unused subpattern are set to -1.  
1765    
1766         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1767         of the string that it matched that is returned.         of the string that it matched that is returned.
1768    
1769         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
1770         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1771         function returns a value of zero. In particular, if the substring  off-         function  returns a value of zero. In particular, if the substring off-
1772         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1773         as NULL and ovecsize as zero. However, if  the  pattern  contains  back         as  NULL  and  ovecsize  as zero. However, if the pattern contains back
1774         references  and  the  ovector is not big enough to remember the related         references and the ovector is not big enough to  remember  the  related
1775         substrings, PCRE has to get additional memory for use during  matching.         substrings,  PCRE has to get additional memory for use during matching.
1776         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1777    
1778         Note  that  pcre_info() can be used to find out how many capturing sub-         The pcre_info() function can be used to find  out  how  many  capturing
1779         patterns there are in a compiled pattern. The smallest size for ovector         subpatterns  there  are  in  a  compiled pattern. The smallest size for
1780         that  will  allow for n captured substrings, in addition to the offsets         ovector that will allow for n captured substrings, in addition  to  the
1781         of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1782    
1783           It  is  possible for capturing subpattern number n+1 to match some part
1784           of the subject when subpattern n has not been used at all. For example,
1785           if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1786           return from the function is 4, and subpatterns 1 and 3 are matched, but
1787           2  is  not.  When  this happens, both values in the offset pairs corre-
1788           sponding to unused subpatterns are set to -1.
1789    
1790           Offset values that correspond to unused subpatterns at the end  of  the
1791           expression  are  also  set  to  -1. For example, if the string "abc" is
1792           matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
1793           matched.  The  return  from the function is 2, because the highest used
1794           capturing subpattern number is 1. However, you can refer to the offsets
1795           for  the  second  and third capturing subpatterns if you wish (assuming
1796           the vector is large enough, of course).
1797    
1798           Some convenience functions are provided  for  extracting  the  captured
1799           substrings as separate strings. These are described below.
1800    
1801     Return values from pcre_exec()     Error return values from pcre_exec()
1802    
1803         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
1804         defined in the header file:         defined in the header file:
1805    
1806           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1336  MATCHING A PATTERN Line 1809  MATCHING A PATTERN
1809    
1810           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1811    
1812         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1813         ovecsize was not zero.         ovecsize was not zero.
1814    
1815           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1345  MATCHING A PATTERN Line 1818  MATCHING A PATTERN
1818    
1819           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1820    
1821         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1822         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1823         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1824         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
1825         gives when the magic number is not present.         gives when the magic number is not present.
1826    
1827           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_NODE   (-5)
1828    
1829         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1830         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1831         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1832    
1833           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1834    
1835         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
1836         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1837         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
1838         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
1839         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
1840    
1841           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1842    
1843         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1844         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1845         returned by pcre_exec().         returned by pcre_exec().
1846    
1847           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1848    
1849         The recursion and backtracking limit, as specified by  the  match_limit         The  backtracking  limit,  as  specified  by the match_limit field in a
1850         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         pcre_extra structure (or defaulted) was reached.  See  the  description
1851           above.
1852    
1853             PCRE_ERROR_RECURSIONLIMIT (-21)
1854    
1855           The internal recursion limit, as specified by the match_limit_recursion
1856           field in a pcre_extra structure (or defaulted)  was  reached.  See  the
1857         description above.         description above.
1858    
1859           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
1860    
1861         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
1862         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
1863         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
1864    
1865           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
1866    
1867         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
1868         subject.         subject.
1869    
1870           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
1871    
1872         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
1873         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
1874         ter.         ter.
1875    
1876           PCRE_ERROR_PARTIAL (-12)           PCRE_ERROR_PARTIAL        (-12)
1877    
1878         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
1879         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
1880    
1881           PCRE_ERROR_BAD_PARTIAL (-13)           PCRE_ERROR_BADPARTIAL     (-13)
1882    
1883         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
1884         items  that are not supported for partial matching. See the pcrepartial         items that are not supported for partial matching. See the  pcrepartial
1885         documentation for details of partial matching.         documentation for details of partial matching.
1886    
1887           PCRE_ERROR_INTERNAL (-14)           PCRE_ERROR_INTERNAL       (-14)
1888    
1889         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
1890         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
1891    
1892           PCRE_ERROR_BADCOUNT (-15)           PCRE_ERROR_BADCOUNT       (-15)
1893    
1894         This  error is given if the value of the ovecsize argument is negative.         This error is given if the value of the ovecsize argument is  negative.
1895    
1896    
1897  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 1428  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1907  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1907         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
1908              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
1909    
1910         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
1911         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
1912         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1913         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
1914         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
1915         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
1916         substrings. A substring  that  contains  a  binary  zero  is  correctly         substrings.
1917         extracted  and  has  a further zero added on the end, but the result is  
1918         not, of course, a C string.         A  substring that contains a binary zero is correctly extracted and has
1919           a further zero added on the end, but the result is not, of course, a  C
1920           string.   However,  you  can  process such a string by referring to the
1921           length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
1922           string().  Unfortunately, the interface to pcre_get_substring_list() is
1923           not adequate for handling strings containing binary zeros, because  the
1924           end of the final string is not independently indicated.
1925    
1926         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
1927         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
1928         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
1929         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
1930         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
1931         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
1932         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
1933         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
1934         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
1935    
1936         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
1937         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
1938         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
1939         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
1940         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
1941         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
1942         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
1943         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
1944         the terminating zero, or one of         the terminating zero, or one of
1945    
1946           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1947    
1948         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
1949         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
1950    
1951           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1952    
1953         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
1954    
1955         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
1956         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
1957         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
1958         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
1959         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
1960         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all went well, or
1961    
1962           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1963    
1964         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
1965    
1966         When any of these functions encounter a substring that is unset,  which         When  any of these functions encounter a substring that is unset, which
1967         can  happen  when  capturing subpattern number n+1 matches some part of         can happen when capturing subpattern number n+1 matches  some  part  of
1968         the subject, but subpattern n has not been used at all, they return  an         the  subject, but subpattern n has not been used at all, they return an
1969         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
1970         string by inspecting the appropriate offset in ovector, which is  nega-         string  by inspecting the appropriate offset in ovector, which is nega-
1971         tive for unset substrings.         tive for unset substrings.
1972    
1973         The  two convenience functions pcre_free_substring() and pcre_free_sub-         The two convenience functions pcre_free_substring() and  pcre_free_sub-
1974         string_list() can be used to free the memory  returned  by  a  previous         string_list()  can  be  used  to free the memory returned by a previous
1975         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1976         tively. They do nothing more than  call  the  function  pointed  to  by         tively.  They  do  nothing  more  than  call the function pointed to by
1977         pcre_free,  which  of course could be called directly from a C program.         pcre_free, which of course could be called directly from a  C  program.
1978         However, PCRE is used in some situations where it is linked via a  spe-         However,  PCRE is used in some situations where it is linked via a spe-
1979         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
1980         pcre_free directly; it is  for  these  cases  that  the  functions  are         pcre_free  directly;  it is for these cases that the functions are pro-
1981         provided.         vided.
1982    
1983    
1984  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
# Line 1511  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1996  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1996              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1997              const char **stringptr);              const char **stringptr);
1998    
1999         To  extract a substring by name, you first have to find associated num-         To extract a substring by name, you first have to find associated  num-
2000         ber.  For example, for this pattern         ber.  For example, for this pattern
2001    
2002           (a+)b(?<xxx>\d+)...           (a+)b(?P<xxx>\d+)...
2003    
2004         the number of the subpattern called "xxx" is 2. You can find the number         the number of the subpattern called "xxx" is 2. If the name is known to
2005         from the name by calling pcre_get_stringnumber(). The first argument is         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2006         the compiled pattern, and the second is the  name.  The  yield  of  the         name by calling pcre_get_stringnumber(). The first argument is the com-
2007         function  is  the  subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if         piled pattern, and the second is the name. The yield of the function is
2008         there is no subpattern of that name.         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2009           subpattern of that name.
2010    
2011         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2012         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
# Line 1541  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2027  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2027         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2028         ate.         ate.
2029    
 Last updated: 09 September 2004  
 Copyright (c) 1997-2004 University of Cambridge.  
 -----------------------------------------------------------------------------  
2030    
2031  PCRE(3)                                                                PCRE(3)  DUPLICATE SUBPATTERN NAMES
2032    
2033           int pcre_get_stringtable_entries(const pcre *code,
2034                const char *name, char **first, char **last);
2035    
2036           When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2037           subpatterns are not required to  be  unique.  Normally,  patterns  with
2038           duplicate  names  are such that in any one match, only one of the named
2039           subpatterns participates. An example is shown in the pcrepattern  docu-
2040           mentation. When duplicates are present, pcre_copy_named_substring() and
2041           pcre_get_named_substring() return the first substring corresponding  to
2042           the  given  name  that  is  set.  If  none  are set, an empty string is
2043           returned.  The pcre_get_stringnumber() function returns one of the num-
2044           bers  that are associated with the name, but it is not defined which it
2045           is.
2046    
2047           If you want to get full details of all captured substrings for a  given
2048           name,  you  must  use  the pcre_get_stringtable_entries() function. The
2049           first argument is the compiled pattern, and the second is the name. The
2050           third  and  fourth  are  pointers to variables which are updated by the
2051           function. After it has run, they point to the first and last entries in
2052           the  name-to-number  table  for  the  given  name.  The function itself
2053           returns the length of each entry, or  PCRE_ERROR_NOSUBSTRING  if  there
2054           are  none.  The  format  of the table is described above in the section
2055           entitled Information about a pattern. Given all  the  relevant  entries
2056           for the name, you can extract each of their numbers, and hence the cap-
2057           tured data, if any.
2058    
2059    
2060    FINDING ALL POSSIBLE MATCHES
2061    
2062           The traditional matching function uses a  similar  algorithm  to  Perl,
2063           which stops when it finds the first match, starting at a given point in
2064           the subject. If you want to find all possible matches, or  the  longest
2065           possible  match,  consider using the alternative matching function (see
2066           below) instead. If you cannot use the alternative function,  but  still
2067           need  to  find all possible matches, you can kludge it up by making use
2068           of the callout facility, which is described in the pcrecallout documen-
2069           tation.
2070    
2071           What you have to do is to insert a callout right at the end of the pat-
2072           tern.  When your callout function is called, extract and save the  cur-
2073           rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2074           backtrack and try other alternatives. Ultimately, when it runs  out  of
2075           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2076    
2077    
2078    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2079    
2080           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2081                const char *subject, int length, int startoffset,
2082                int options, int *ovector, int ovecsize,
2083                int *workspace, int wscount);
2084    
2085           The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2086           against a compiled pattern, using a "DFA" matching algorithm. This  has
2087           different  characteristics to the normal algorithm, and is not compati-
2088           ble with Perl. Some of the features of PCRE patterns are not supported.
2089           Nevertheless, there are times when this kind of matching can be useful.
2090           For a discussion of the two matching algorithms, see  the  pcrematching
2091           documentation.
2092    
2093           The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2094           pcre_exec(), plus two extras. The ovector argument is used in a differ-
2095           ent  way,  and  this is described below. The other common arguments are
2096           used in the same way as for pcre_exec(), so their  description  is  not
2097           repeated here.
2098    
2099           The  two  additional  arguments provide workspace for the function. The
2100           workspace vector should contain at least 20 elements. It  is  used  for
2101           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2102           workspace will be needed for patterns and subjects where  there  are  a
2103           lot of potential matches.
2104    
2105           Here is an example of a simple call to pcre_dfa_exec():
2106    
2107             int rc;
2108             int ovector[10];
2109             int wspace[20];
2110             rc = pcre_dfa_exec(
2111               re,             /* result of pcre_compile() */
2112               NULL,           /* we didn't study the pattern */
2113               "some string",  /* the subject string */
2114               11,             /* the length of the subject string */
2115               0,              /* start at offset 0 in the subject */
2116               0,              /* default options */
2117               ovector,        /* vector of integers for substring information */
2118               10,             /* number of elements (NOT size in bytes) */
2119               wspace,         /* working space vector */
2120               20);            /* number of elements (NOT size in bytes) */
2121    
2122       Option bits for pcre_dfa_exec()
2123    
2124           The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2125           zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2126           LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2127           PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2128           three of these are the same as for pcre_exec(), so their description is
2129           not repeated here.
2130    
2131             PCRE_PARTIAL
2132    
2133           This has the same general effect as it does for  pcre_exec(),  but  the
2134           details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
2135           pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
2136           PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
2137           been no complete matches, but there is still at least one matching pos-
2138           sibility.  The portion of the string that provided the partial match is
2139           set as the first matching string.
2140    
2141             PCRE_DFA_SHORTEST
2142    
2143           Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2144           stop  as  soon  as  it  has found one match. Because of the way the DFA
2145           algorithm works, this is necessarily the shortest possible match at the
2146           first possible matching point in the subject string.
2147    
2148             PCRE_DFA_RESTART
2149    
2150           When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
2151           returns a partial match, it is possible to call it  again,  with  addi-
2152           tional  subject  characters,  and have it continue with the same match.
2153           The PCRE_DFA_RESTART option requests this action; when it is  set,  the
2154           workspace  and wscount options must reference the same vector as before
2155           because data about the match so far is left in  them  after  a  partial
2156           match.  There  is  more  discussion of this facility in the pcrepartial
2157           documentation.
2158    
2159       Successful returns from pcre_dfa_exec()
2160    
2161           When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
2162           string in the subject. Note, however, that all the matches from one run
2163           of the function start at the same point in  the  subject.  The  shorter
2164           matches  are all initial substrings of the longer matches. For example,
2165           if the pattern
2166    
2167             <.*>
2168    
2169           is matched against the string
2170    
2171             This is <something> <something else> <something further> no more
2172    
2173           the three matched strings are
2174    
2175             <something>
2176             <something> <something else>
2177             <something> <something else> <something further>
2178    
2179           On success, the yield of the function is a number  greater  than  zero,
2180           which  is  the  number of matched substrings. The substrings themselves
2181           are returned in ovector. Each string uses two elements;  the  first  is
2182           the  offset  to the start, and the second is the offset to the end. All
2183           the strings have the same start offset. (Space could have been saved by
2184           giving  this only once, but it was decided to retain some compatibility
2185           with the way pcre_exec() returns data, even though the meaning  of  the
2186           strings is different.)
2187    
2188           The strings are returned in reverse order of length; that is, the long-
2189           est matching string is given first. If there were too many  matches  to
2190           fit  into ovector, the yield of the function is zero, and the vector is
2191           filled with the longest matches.
2192    
2193       Error returns from pcre_dfa_exec()
2194    
2195           The pcre_dfa_exec() function returns a negative number when  it  fails.
2196           Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2197           described above.  There are in addition the following errors  that  are
2198           specific to pcre_dfa_exec():
2199    
2200             PCRE_ERROR_DFA_UITEM      (-16)
2201    
2202           This  return is given if pcre_dfa_exec() encounters an item in the pat-
2203           tern that it does not support, for instance, the use of \C  or  a  back
2204           reference.
2205    
2206             PCRE_ERROR_DFA_UCOND      (-17)
2207    
2208           This  return is given if pcre_dfa_exec() encounters a condition item in
2209           a pattern that uses a back reference for the  condition.  This  is  not
2210           supported.
2211    
2212             PCRE_ERROR_DFA_UMLIMIT    (-18)
2213    
2214           This  return  is given if pcre_dfa_exec() is called with an extra block
2215           that contains a setting of the match_limit field. This is not supported
2216           (it is meaningless).
2217    
2218             PCRE_ERROR_DFA_WSSIZE     (-19)
2219    
2220           This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2221           workspace vector.
2222    
2223             PCRE_ERROR_DFA_RECURSE    (-20)
2224    
2225           When a recursive subpattern is processed, the matching  function  calls
2226           itself  recursively,  using  private vectors for ovector and workspace.
2227           This error is given if the output vector  is  not  large  enough.  This
2228           should be extremely rare, as a vector of size 1000 is used.
2229    
2230    Last updated: 08 June 2006
2231    Copyright (c) 1997-2006 University of Cambridge.
2232    ------------------------------------------------------------------------------
2233    
2234    
2235    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2236    
2237    
2238  NAME  NAME
2239         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2240    
2241    
2242  PCRE CALLOUTS  PCRE CALLOUTS
2243    
2244         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
# Line 1606  MISSING CALLOUTS Line 2293  MISSING CALLOUTS
2293  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
2294    
2295         During matching, when PCRE reaches a callout point, the external  func-         During matching, when PCRE reaches a callout point, the external  func-
2296         tion  defined  by pcre_callout is called (if it is set). The only argu-         tion  defined by pcre_callout is called (if it is set). This applies to
2297         ment is a pointer to a pcre_callout block. This structure contains  the         both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The
2298         following fields:         only  argument  to  the callout function is a pointer to a pcre_callout
2299           block. This structure contains the following fields:
2300    
2301           int          version;           int          version;
2302           int          callout_number;           int          callout_number;
# Line 1623  THE CALLOUT INTERFACE Line 2311  THE CALLOUT INTERFACE
2311           int          pattern_position;           int          pattern_position;
2312           int          next_item_length;           int          next_item_length;
2313    
2314         The  version  field  is an integer containing the version number of the         The version field is an integer containing the version  number  of  the
2315         block format. The initial version was 0; the current version is 1.  The         block  format. The initial version was 0; the current version is 1. The
2316         version  number  will  change  again in future if additional fields are         version number will change again in future  if  additional  fields  are
2317         added, but the intention is never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2318    
2319         The  callout_number  field  contains the number of the callout, as com-         The callout_number field contains the number of the  callout,  as  com-
2320         piled into the pattern (that is, the number after ?C for  manual  call-         piled  into  the pattern (that is, the number after ?C for manual call-
2321         outs, and 255 for automatically generated callouts).         outs, and 255 for automatically generated callouts).
2322    
2323         The  offset_vector field is a pointer to the vector of offsets that was         The offset_vector field is a pointer to the vector of offsets that  was
2324         passed by the caller to pcre_exec(). The contents can be  inspected  in         passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2325         order  to extract substrings that have been matched so far, in the same         pcre_exec() is used, the contents can be inspected in order to  extract
2326         way as for extracting substrings after a match has completed.         substrings  that  have  been  matched  so  far,  in the same way as for
2327           extracting substrings after a match has completed. For  pcre_dfa_exec()
2328           this field is not useful.
2329    
2330         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2331         were passed to pcre_exec().         were passed to pcre_exec().
2332    
2333         The  start_match  field contains the offset within the subject at which         The start_match field contains the offset within the subject  at  which
2334         the current match attempt started. If the pattern is not anchored,  the         the  current match attempt started. If the pattern is not anchored, the
2335         callout function may be called several times from the same point in the         callout function may be called several times from the same point in the
2336         pattern for different starting points in the subject.         pattern for different starting points in the subject.
2337    
2338         The current_position field contains the offset within  the  subject  of         The  current_position  field  contains the offset within the subject of
2339         the current match pointer.         the current match pointer.
2340    
2341         The  capture_top field contains one more than the number of the highest         When the pcre_exec() function is used, the capture_top  field  contains
2342         numbered captured substring so far. If no  substrings  have  been  cap-         one  more than the number of the highest numbered captured substring so
2343         tured, the value of capture_top is one.         far. If no substrings have been captured, the value of  capture_top  is
2344           one.  This  is always the case when pcre_dfa_exec() is used, because it
2345         The  capture_last  field  contains the number of the most recently cap-         does not support captured substrings.
2346         tured substring. If no substrings have been captured, its value is  -1.  
2347           The capture_last field contains the number of the  most  recently  cap-
2348         The  callout_data  field contains a value that is passed to pcre_exec()         tured  substring. If no substrings have been captured, its value is -1.
2349         by the caller specifically so that it can be passed back  in  callouts.         This is always the case when pcre_dfa_exec() is used.
2350         It  is  passed  in the pcre_callout field of the pcre_extra data struc-  
2351         ture. If no such data was  passed,  the  value  of  callout_data  in  a         The callout_data field contains a value that is passed  to  pcre_exec()
2352         pcre_callout  block  is  NULL. There is a description of the pcre_extra         or  pcre_dfa_exec() specifically so that it can be passed back in call-
2353           outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
2354           structure.  If  no such data was passed, the value of callout_data in a
2355           pcre_callout block is NULL. There is a description  of  the  pcre_extra
2356         structure in the pcreapi documentation.         structure in the pcreapi documentation.
2357    
2358         The pattern_position field is present from version 1 of the  pcre_call-         The  pattern_position field is present from version 1 of the pcre_call-
2359         out structure. It contains the offset to the next item to be matched in         out structure. It contains the offset to the next item to be matched in
2360         the pattern string.         the pattern string.
2361    
2362         The next_item_length field is present from version 1 of the  pcre_call-         The  next_item_length field is present from version 1 of the pcre_call-
2363         out structure. It contains the length of the next item to be matched in         out structure. It contains the length of the next item to be matched in
2364         the pattern string. When the callout immediately precedes  an  alterna-         the  pattern  string. When the callout immediately precedes an alterna-
2365         tion  bar, a closing parenthesis, or the end of the pattern, the length         tion bar, a closing parenthesis, or the end of the pattern, the  length
2366         is zero. When the callout precedes an opening parenthesis,  the  length         is  zero.  When the callout precedes an opening parenthesis, the length
2367         is that of the entire subpattern.         is that of the entire subpattern.
2368    
2369         The  pattern_position  and next_item_length fields are intended to help         The pattern_position and next_item_length fields are intended  to  help
2370         in distinguishing between different automatic callouts, which all  have         in  distinguishing between different automatic callouts, which all have
2371         the same callout number. However, they are set for all callouts.         the same callout number. However, they are set for all callouts.
2372    
2373    
2374  RETURN VALUES  RETURN VALUES
2375    
2376         The  external callout function returns an integer to PCRE. If the value         The external callout function returns an integer to PCRE. If the  value
2377         is zero, matching proceeds as normal. If  the  value  is  greater  than         is  zero,  matching  proceeds  as  normal. If the value is greater than
2378         zero,  matching  fails  at  the current point, but backtracking to test         zero, matching fails at the current point, but  the  testing  of  other
2379         other matching possibilities goes ahead, just as if a lookahead  asser-         matching possibilities goes ahead, just as if a lookahead assertion had
2380         tion  had  failed.  If  the value is less than zero, the match is aban-         failed. If the value is less than zero, the  match  is  abandoned,  and
2381         doned, and pcre_exec() returns the negative value.         pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2382    
2383         Negative  values  should  normally  be   chosen   from   the   set   of         Negative   values   should   normally   be   chosen  from  the  set  of
2384         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2385         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
2386         reserved  for  use  by callout functions; it will never be used by PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2387         itself.         itself.
2388    
2389  Last updated: 09 September 2004  Last updated: 28 February 2005
2390  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
2391  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
2392    
 PCRE(3)                                                                PCRE(3)  
2393    
2394    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2395    
2396    
2397  NAME  NAME
2398         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2399    
2400    
2401  DIFFERENCES BETWEEN PCRE AND PERL  DIFFERENCES BETWEEN PCRE AND PERL
2402    
2403         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2404         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
2405         respect to Perl 5.8.         respect to Perl 5.8.
2406    
2407         1.  PCRE does not have full UTF-8 support. Details of what it does have         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
2408         are given in the section on UTF-8 support in the main pcre page.         of what it does have are given in the section on UTF-8 support  in  the
2409           main pcre page.
2410    
2411         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2412         permits  them,  but they do not mean what you might think. For example,         permits them, but they do not mean what you might think.  For  example,
2413         (?!a){3} does not assert that the next three characters are not "a". It         (?!a){3} does not assert that the next three characters are not "a". It
2414         just asserts that the next character is not "a" three times.         just asserts that the next character is not "a" three times.
2415    
2416         3.  Capturing  subpatterns  that occur inside negative lookahead asser-         3. Capturing subpatterns that occur inside  negative  lookahead  asser-
2417         tions are counted, but their entries in the offsets  vector  are  never         tions  are  counted,  but their entries in the offsets vector are never
2418         set.  Perl sets its numerical variables from any such patterns that are         set. Perl sets its numerical variables from any such patterns that  are
2419         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
2420         ing),  but  only  if the negative lookahead assertion contains just one         ing), but only if the negative lookahead assertion  contains  just  one
2421         branch.         branch.
2422    
2423         4. Though binary zero characters are supported in the  subject  string,         4.  Though  binary zero characters are supported in the subject string,
2424         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2425         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
2426         the pattern to represent a binary zero.         the pattern to represent a binary zero.
2427    
2428         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5. The following Perl escape sequences are not supported: \l,  \u,  \L,
2429         \U, and \N. In fact these are implemented by Perl's general string-han-         \U, and \N. In fact these are implemented by Perl's general string-han-
2430         dling  and are not part of its pattern matching engine. If any of these         dling and are not part of its pattern matching engine. If any of  these
2431         are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2432    
2433         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE
2434         is  built  with Unicode character property support. The properties that         is built with Unicode character property support. The  properties  that
2435         can be tested with \p and \P are limited to the general category  prop-         can  be tested with \p and \P are limited to the general category prop-
2436         erties such as Lu and Nd.         erties such as Lu and Nd, script names such as Greek or  Han,  and  the
2437           derived properties Any and L&.
2438    
2439         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2440         ters in between are treated as literals.  This  is  slightly  different         ters in between are treated as literals.  This  is  slightly  different
# Line 1778  DIFFERENCES BETWEEN PCRE AND PERL Line 2474  DIFFERENCES BETWEEN PCRE AND PERL
2474         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2475    
2476         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2477         cial meaning is faulted.         cial meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash  is
2478           ignored. (Perl can be made to issue a warning.)
2479    
2480         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
2481         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2482         lowed by a question mark they are.         lowed by a question mark they are.
2483    
2484         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2485         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2486    
2487         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
2488         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2489    
2490         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
2491         pattern  matching  (Perl  can  do  this using the (?p{code}) construct,         pattern matching (Perl can do  this  using  the  (?p{code})  construct,
2492         which PCRE cannot support.)         which PCRE cannot support.)
2493    
2494         (h) PCRE supports named capturing substrings, using the Python  syntax.         (h)  PCRE supports named capturing substrings, using the Python syntax.
2495    
2496         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
2497         Sun's Java package.         Sun's Java package.
2498    
2499         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) The (R) condition, for testing recursion, is a PCRE extension.
# Line 1808  DIFFERENCES BETWEEN PCRE AND PERL Line 2505  DIFFERENCES BETWEEN PCRE AND PERL
2505         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2506         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2507    
2508  Last updated: 09 September 2004         (n) The alternative matching function (pcre_dfa_exec())  matches  in  a
2509  Copyright (c) 1997-2004 University of Cambridge.         different way and is not Perl-compatible.
2510  -----------------------------------------------------------------------------  
2511    Last updated: 06 June 2006
2512    Copyright (c) 1997-2006 University of Cambridge.
2513    ------------------------------------------------------------------------------
2514    
 PCRE(3)                                                                PCRE(3)  
2515    
2516    PCREPATTERN(3)                                                  PCREPATTERN(3)
2517    
2518    
2519  NAME  NAME
2520         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2521    
2522    
2523  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2524    
2525         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax  and semantics of the regular expressions supported by PCRE
# Line 1836  PCRE REGULAR EXPRESSION DETAILS Line 2537  PCRE REGULAR EXPRESSION DETAILS
2537         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
2538         page.         page.
2539    
2540           The remainder of this document discusses the  patterns  that  are  sup-
2541           ported  by  PCRE when its main matching function, pcre_exec(), is used.
2542           From  release  6.0,   PCRE   offers   a   second   matching   function,
2543           pcre_dfa_exec(),  which matches using a different algorithm that is not
2544           Perl-compatible. The advantages and disadvantages  of  the  alternative
2545           function, and how it differs from the normal function, are discussed in
2546           the pcrematching page.
2547    
2548         A regular expression is a pattern that is  matched  against  a  subject         A regular expression is a pattern that is  matched  against  a  subject
2549         string  from  left  to right. Most characters stand for themselves in a         string  from  left  to right. Most characters stand for themselves in a
2550         pattern, and match the corresponding characters in the  subject.  As  a         pattern, and match the corresponding characters in the  subject.  As  a
# Line 1843  PCRE REGULAR EXPRESSION DETAILS Line 2552  PCRE REGULAR EXPRESSION DETAILS
2552    
2553           The quick brown fox           The quick brown fox
2554    
2555         matches  a portion of a subject string that is identical to itself. The         matches a portion of a subject string that is identical to itself. When
2556         power of regular expressions comes from the ability to include alterna-         caseless matching is specified (the PCRE_CASELESS option), letters  are
2557         tives  and repetitions in the pattern. These are encoded in the pattern         matched  independently  of case. In UTF-8 mode, PCRE always understands
2558         by the use of metacharacters, which do not  stand  for  themselves  but         the concept of case for characters whose values are less than  128,  so
2559         instead are interpreted in some special way.         caseless  matching  is always possible. For characters with higher val-
2560           ues, the concept of case is supported if PCRE is compiled with  Unicode
2561         There  are  two different sets of metacharacters: those that are recog-         property  support,  but  not  otherwise.   If  you want to use caseless
2562         nized anywhere in the pattern except within square brackets, and  those         matching for characters 128 and above, you must  ensure  that  PCRE  is
2563         that  are  recognized  in square brackets. Outside square brackets, the         compiled with Unicode property support as well as with UTF-8 support.
2564    
2565           The  power  of  regular  expressions  comes from the ability to include
2566           alternatives and repetitions in the pattern. These are encoded  in  the
2567           pattern by the use of metacharacters, which do not stand for themselves
2568           but instead are interpreted in some special way.
2569    
2570           There are two different sets of metacharacters: those that  are  recog-
2571           nized  anywhere in the pattern except within square brackets, and those
2572           that are recognized in square brackets. Outside  square  brackets,  the
2573         metacharacters are as follows:         metacharacters are as follows:
2574    
2575           \      general escape character with several uses           \      general escape character with several uses
# Line 1870  PCRE REGULAR EXPRESSION DETAILS Line 2588  PCRE REGULAR EXPRESSION DETAILS
2588                  also "possessive quantifier"                  also "possessive quantifier"
2589           {      start min/max quantifier           {      start min/max quantifier
2590    
2591         Part of a pattern that is in square brackets  is  called  a  "character         Part  of  a  pattern  that is in square brackets is called a "character
2592         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2593    
2594           \      general escape character           \      general escape character
# Line 1880  PCRE REGULAR EXPRESSION DETAILS Line 2598  PCRE REGULAR EXPRESSION DETAILS
2598                    syntax)                    syntax)
2599           ]      terminates the character class           ]      terminates the character class
2600    
2601         The  following sections describe the use of each of the metacharacters.         The following sections describe the use of each of the  metacharacters.
2602    
2603    
2604  BACKSLASH  BACKSLASH
2605    
2606         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2607         a  non-alphanumeric  character,  it takes away any special meaning that         a non-alphanumeric character, it takes away any  special  meaning  that
2608         character may have. This  use  of  backslash  as  an  escape  character         character  may  have.  This  use  of  backslash  as an escape character
2609         applies both inside and outside character classes.         applies both inside and outside character classes.
2610    
2611         For  example,  if  you want to match a * character, you write \* in the         For example, if you want to match a * character, you write  \*  in  the
2612         pattern.  This escaping action applies whether  or  not  the  following         pattern.   This  escaping  action  applies whether or not the following
2613         character  would  otherwise be interpreted as a metacharacter, so it is         character would otherwise be interpreted as a metacharacter, so  it  is
2614         always safe to precede a non-alphanumeric  with  backslash  to  specify         always  safe  to  precede  a non-alphanumeric with backslash to specify
2615         that  it stands for itself. In particular, if you want to match a back-         that it stands for itself. In particular, if you want to match a  back-
2616         slash, you write \\.         slash, you write \\.
2617    
2618         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
2619         the  pattern (other than in a character class) and characters between a         the pattern (other than in a character class) and characters between  a
2620         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline are ignored. An escap-
2621         An  escaping backslash can be used to include a whitespace or # charac-         ing backslash can be used to include a whitespace  or  #  character  as
2622         ter as part of the pattern.         part of the pattern.
2623    
2624         If you want to remove the special meaning from a  sequence  of  charac-         If  you  want  to remove the special meaning from a sequence of charac-
2625         ters,  you can do so by putting them between \Q and \E. This is differ-         ters, you can do so by putting them between \Q and \E. This is  differ-
2626         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
2627         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
2628         tion. Note the following examples:         tion. Note the following examples:
2629    
2630           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 1916  BACKSLASH Line 2634  BACKSLASH
2634           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2635           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2636    
2637         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
2638         classes.         classes.
2639    
2640     Non-printing characters     Non-printing characters
2641    
2642         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2643         acters in patterns in a visible manner. There is no restriction on  the         acters  in patterns in a visible manner. There is no restriction on the
2644         appearance  of non-printing characters, apart from the binary zero that         appearance of non-printing characters, apart from the binary zero  that
2645         terminates a pattern, but when a pattern  is  being  prepared  by  text         terminates  a  pattern,  but  when  a pattern is being prepared by text
2646         editing,  it  is  usually  easier  to  use  one of the following escape         editing, it is usually easier  to  use  one  of  the  following  escape
2647         sequences than the binary character it represents:         sequences than the binary character it represents:
2648    
2649           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 1937  BACKSLASH Line 2655  BACKSLASH
2655           \t        tab (hex 09)           \t        tab (hex 09)
2656           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
2657           \xhh      character with hex code hh           \xhh      character with hex code hh
2658           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh..
2659    
2660         The precise effect of \cx is as follows: if x is a lower  case  letter,         The  precise  effect of \cx is as follows: if x is a lower case letter,
2661         it  is converted to upper case. Then bit 6 of the character (hex 40) is         it is converted to upper case. Then bit 6 of the character (hex 40)  is
2662         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
2663         becomes hex 7B.         becomes hex 7B.
2664    
2665         After  \x, from zero to two hexadecimal digits are read (letters can be         After \x, from zero to two hexadecimal digits are read (letters can  be
2666         in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-         in  upper  or  lower case). Any number of hexadecimal digits may appear
2667         its  may  appear between \x{ and }, but the value of the character code         between \x{ and }, but the value of the character  code  must  be  less
2668         must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2669         7FFFFFFF).  If  characters other than hexadecimal digits appear between         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than
2670         \x{ and }, or if there is no terminating }, this form of escape is  not         hexadecimal  digits  appear between \x{ and }, or if there is no termi-
2671         recognized. Instead, the initial \x will be interpreted as a basic hex-         nating }, this form of escape is not recognized.  Instead, the  initial
2672         adecimal escape, with no following digits,  giving  a  character  whose         \x will be interpreted as a basic hexadecimal escape, with no following
2673         value is zero.         digits, giving a character whose value is zero.
2674    
2675         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2676         two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference         two  syntaxes  for  \x. There is no difference in the way they are han-
2677         in  the  way they are handled. For example, \xdc is exactly the same as         dled. For example, \xdc is exactly the same as \x{dc}.
2678         \x{dc}.  
2679           After \0 up to two further octal digits are read. If  there  are  fewer
2680         After \0 up to two further octal digits are read.  In  both  cases,  if         than  two  digits,  just  those  that  are  present  are used. Thus the
2681         there  are fewer than two digits, just those that are present are used.         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2682         Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL         (code  value 7). Make sure you supply two digits after the initial zero
2683         character  (code  value  7).  Make sure you supply two digits after the         if the pattern character that follows is itself an octal digit.
        initial zero if the pattern character that follows is itself  an  octal  
        digit.  
2684    
2685         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2686         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2687         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
2688         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2689         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
2690         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
2691         of parenthesized subpatterns.         of parenthesized subpatterns.
2692    
2693         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
2694         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
2695         up  to three octal digits following the backslash, and generates a sin-         up to three octal digits following the backslash, ane uses them to gen-
2696         gle byte from the least significant 8 bits of the value. Any subsequent         erate  a data character. Any subsequent digits stand for themselves. In
2697         digits stand for themselves.  For example:         non-UTF-8 mode, the value of a character specified  in  octal  must  be
2698           less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
2699           example:
2700    
2701           \040   is another way of writing a space           \040   is another way of writing a space
2702           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 1995  BACKSLASH Line 2713  BACKSLASH
2713           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2714                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2715    
2716         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
2717         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2718    
2719         All the sequences that define a single byte value  or  a  single  UTF-8         All the sequences that define a single character value can be used both
2720         character (in UTF-8 mode) can be used both inside and outside character         inside and outside character classes. In addition, inside  a  character
2721         classes. In addition, inside a character  class,  the  sequence  \b  is         class,  the  sequence \b is interpreted as the backspace character (hex
2722         interpreted as the backspace character (hex 08), and the sequence \X is         08), and the sequence \X is interpreted as the character "X". Outside a
2723         interpreted as the character "X".  Outside  a  character  class,  these         character class, these sequences have different meanings (see below).
        sequences have different meanings (see below).  
2724    
2725     Generic character types     Generic character types
2726    
# Line 2028  BACKSLASH Line 2745  BACKSLASH
2745    
2746         For  compatibility  with Perl, \s does not match the VT character (code         For  compatibility  with Perl, \s does not match the VT character (code
2747         11).  This makes it different from the the POSIX "space" class. The  \s         11).  This makes it different from the the POSIX "space" class. The  \s
2748         characters are HT (9), LF (10), FF (12), CR (13), and space (32).         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If
2749           "use locale;" is included in a Perl script, \s may match the VT charac-
2750           ter. In PCRE, it never does.)
2751    
2752         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
2753         is a letter or digit. The definition of  letters  and  digits  is  con-         is a letter or digit. The definition of  letters  and  digits  is  con-
# Line 2040  BACKSLASH Line 2759  BACKSLASH
2759    
2760         In  UTF-8 mode, characters with values greater than 128 never match \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
2761         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2762         code character property support is available.         code  character  property support is available. The use of locales with
2763           Unicode is discouraged.
2764    
2765     Unicode character properties     Unicode character properties
2766    
2767         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
2768         tional escape sequences to match generic character types are  available         tional  escape  sequences  to  match character properties are available
2769         when UTF-8 mode is selected. They are:         when UTF-8 mode is selected. They are:
2770    
2771          \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
2772          \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
2773          \X       an extended Unicode sequence           \X       an extended Unicode sequence
2774    
2775         The  property  names represented by xx above are limited to the Unicode         The property names represented by xx above are limited to  the  Unicode
2776         general category properties. Each character has exactly one such  prop-         script names, the general category properties, and "Any", which matches
2777         erty,  specified  by  a two-letter abbreviation. For compatibility with         any character (including newline). Other properties such as "InMusical-
2778         Perl, negation can be specified by including a circumflex  between  the         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
2779         opening  brace  and the property name. For example, \p{^Lu} is the same         not match any characters, so always causes a match failure.
2780         as \P{Lu}.  
2781           Sets of Unicode characters are defined as belonging to certain scripts.
2782         If only one letter is specified with \p or  \P,  it  includes  all  the         A  character from one of these sets can be matched using a script name.
2783         properties that start with that letter. In this case, in the absence of         For example:
2784         negation, the curly brackets in the escape sequence are optional; these  
2785         two examples have the same effect:           \p{Greek}
2786             \P{Han}
2787    
2788           Those that are not part of an identified script are lumped together  as
2789           "Common". The current list of scripts is:
2790    
2791           Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
2792           dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
2793           Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
2794           Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
2795           Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
2796           Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
2797           Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
2798           banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
2799           Ugaritic, Yi.
2800    
2801           Each  character has exactly one general category property, specified by
2802           a two-letter abbreviation. For compatibility with Perl, negation can be
2803           specified  by  including a circumflex between the opening brace and the
2804           property name. For example, \p{^Lu} is the same as \P{Lu}.
2805    
2806           If only one letter is specified with \p or \P, it includes all the gen-
2807           eral  category properties that start with that letter. In this case, in
2808           the absence of negation, the curly brackets in the escape sequence  are
2809           optional; these two examples have the same effect:
2810    
2811           \p{L}           \p{L}
2812           \pL           \pL
2813    
2814         The following property codes are supported:         The following general category property codes are supported:
2815    
2816           C     Other           C     Other
2817           Cc    Control           Cc    Control
# Line 2113  BACKSLASH Line 2857  BACKSLASH
2857           Zp    Paragraph separator           Zp    Paragraph separator
2858           Zs    Space separator           Zs    Space separator
2859    
2860         Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-         The  special property L& is also supported: it matches a character that
2861         ported by PCRE.         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
2862           classified as a modifier or "other".
2863    
2864           The  long  synonyms  for  these  properties that Perl supports (such as
2865           \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
2866           any of these properties with "Is".
2867    
2868           No character that is in the Unicode table has the Cn (unassigned) prop-
2869           erty.  Instead, this property is assumed for any code point that is not
2870           in the Unicode table.
2871    
2872         Specifying caseless matching does not affect  these  escape  sequences.         Specifying  caseless  matching  does not affect these escape sequences.
2873         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
2874    
2875         The  \X  escape  matches  any number of Unicode characters that form an         The \X escape matches any number of Unicode  characters  that  form  an
2876         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
2877    
2878           (?>\PM\pM*)           (?>\PM\pM*)
2879    
2880         That is, it matches a character without the "mark"  property,  followed         That  is,  it matches a character without the "mark" property, followed
2881         by  zero  or  more  characters with the "mark" property, and treats the         by zero or more characters with the "mark"  property,  and  treats  the
2882         sequence as an atomic group (see below).  Characters  with  the  "mark"         sequence  as  an  atomic group (see below).  Characters with the "mark"
2883         property are typically accents that affect the preceding character.         property are typically accents that affect the preceding character.
2884    
2885         Matching  characters  by Unicode property is not fast, because PCRE has         Matching characters by Unicode property is not fast, because  PCRE  has
2886         to search a structure that contains  data  for  over  fifteen  thousand         to  search  a  structure  that  contains data for over fifteen thousand
2887         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
2888         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
2889    
2890     Simple assertions     Simple assertions
2891    
2892         The fourth use of backslash is for certain simple assertions. An asser-         The fourth use of backslash is for certain simple assertions. An asser-
2893         tion  specifies a condition that has to be met at a particular point in         tion specifies a condition that has to be met at a particular point  in
2894         a match, without consuming any characters from the subject string.  The         a  match, without consuming any characters from the subject string. The
2895         use  of subpatterns for more complicated assertions is described below.         use of subpatterns for more complicated assertions is described  below.
2896         The backslashed assertions are:         The backslashed assertions are:
2897    
2898           \b     matches at a word boundary           \b     matches at a word boundary
# Line 2149  BACKSLASH Line 2902  BACKSLASH
2902           \z     matches at end of subject           \z     matches at end of subject
2903           \G     matches at first matching position in subject           \G     matches at first matching position in subject
2904    
2905         These assertions may not appear in character classes (but note that  \b         These  assertions may not appear in character classes (but note that \b
2906         has a different meaning, namely the backspace character, inside a char-         has a different meaning, namely the backspace character, inside a char-
2907         acter class).         acter class).
2908    
2909         A word boundary is a position in the subject string where  the  current         A  word  boundary is a position in the subject string where the current
2910         character  and  the previous character do not both match \w or \W (i.e.         character and the previous character do not both match \w or  \W  (i.e.
2911         one matches \w and the other matches \W), or the start or  end  of  the         one  matches  \w  and the other matches \W), or the start or end of the
2912         string if the first or last character matches \w, respectively.         string if the first or last character matches \w, respectively.
2913    
2914         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The \A, \Z, and \z assertions differ from  the  traditional  circumflex
2915         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
2916         at  the  very start and end of the subject string, whatever options are         at the very start and end of the subject string, whatever  options  are
2917         set. Thus, they are independent of multiline mode. These  three  asser-         set.  Thus,  they are independent of multiline mode. These three asser-
2918         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
2919         affect only the behaviour of the circumflex and dollar  metacharacters.         affect  only the behaviour of the circumflex and dollar metacharacters.
2920         However,  if the startoffset argument of pcre_exec() is non-zero, indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
2921         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
2922         the  subject,  \A  can never match. The difference between \Z and \z is         the subject, \A can never match. The difference between \Z  and  \z  is
2923         that \Z matches before a newline that is  the  last  character  of  the         that \Z matches before a newline at the end of the string as well as at
2924         string  as well as at the end of the string, whereas \z matches only at         the very end, whereas \z matches only at the end.
        the end.  
2925    
2926         The \G assertion is true only when the current matching position is  at         The \G assertion is true only when the current matching position is  at
2927         the  start point of the match, as specified by the startoffset argument         the  start point of the match, as specified by the startoffset argument
# Line 2208  CIRCUMFLEX AND DOLLAR Line 2960  CIRCUMFLEX AND DOLLAR
2960    
2961         A dollar character is an assertion that is true  only  if  the  current         A dollar character is an assertion that is true  only  if  the  current
2962         matching  point  is  at  the  end of the subject string, or immediately         matching  point  is  at  the  end of the subject string, or immediately
2963         before a newline character that is the last character in the string (by         before a newline at the end of the string (by default). Dollar need not
2964         default).  Dollar  need  not  be the last character of the pattern if a         be  the  last  character of the pattern if a number of alternatives are
2965         number of alternatives are involved, but it should be the last item  in         involved, but it should be the last item in  any  branch  in  which  it
2966         any  branch  in  which  it appears.  Dollar has no special meaning in a         appears. Dollar has no special meaning in a character class.
        character class.  
2967    
2968         The meaning of dollar can be changed so that it  matches  only  at  the         The  meaning  of  dollar  can be changed so that it matches only at the
2969         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
2970         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
2971    
2972         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
2973         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
2974         ately after and  immediately  before  an  internal  newline  character,         matches  immediately after internal newlines as well as at the start of
2975         respectively,  in addition to matching at the start and end of the sub-         the subject string. It does not match after a  newline  that  ends  the
2976         ject string. For example,  the  pattern  /^abc$/  matches  the  subject         string.  A dollar matches before any newlines in the string, as well as
2977         string  "def\nabc"  (where \n represents a newline character) in multi-         at the very end, when PCRE_MULTILINE is set. When newline is  specified
2978         line mode, but not otherwise.  Consequently, patterns that are anchored         as  the  two-character  sequence CRLF, isolated CR and LF characters do
2979         in  single line mode because all branches start with ^ are not anchored         not indicate newlines.
2980         in multiline mode, and a match for  circumflex  is  possible  when  the  
2981         startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
2982         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.         (where  \n  represents a newline) in multiline mode, but not otherwise.
2983           Consequently, patterns that are anchored in single  line  mode  because
2984           all  branches  start  with  ^ are not anchored in multiline mode, and a
2985           match for circumflex is  possible  when  the  startoffset  argument  of
2986           pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
2987           PCRE_MULTILINE is set.
2988    
2989         Note that the sequences \A, \Z, and \z can be used to match  the  start         Note that the sequences \A, \Z, and \z can be used to match  the  start
2990         and  end of the subject in both modes, and if all branches of a pattern         and  end of the subject in both modes, and if all branches of a pattern
2991         start with \A it is always anchored, whether PCRE_MULTILINE is  set  or         start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
2992         not.         set.
2993    
2994    
2995  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
2996    
2997         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2998         ter in the subject, including a non-printing  character,  but  not  (by         ter in the subject string except (by default) a character  that  signi-
2999         default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3000         which might be more than one byte long, except (by default) newline. If         more than one byte long. When a line ending  is  defined  as  a  single
3001         the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-         character  (CR  or LF), dot never matches that character; when the two-
3002         dling of dot is entirely independent of the handling of circumflex  and         character sequence CRLF is used, dot does not match CR if it is immedi-
3003         dollar,  the  only  relationship  being  that they both involve newline         ately  followed by LF, but otherwise it matches all characters (includ-
3004         characters. Dot has no special meaning in a character class.         ing isolated CRs and LFs).
3005    
3006           The behaviour of dot with regard to newlines can  be  changed.  If  the
3007           PCRE_DOTALL  option  is  set,  a dot matches any one character, without
3008           exception. If newline is defined as the two-character sequence CRLF, it
3009           takes two dots to match it.
3010    
3011           The  handling of dot is entirely independent of the handling of circum-
3012           flex and dollar, the only relationship being  that  they  both  involve
3013           newlines. Dot has no special meaning in a character class.
3014    
3015    
3016  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
3017    
3018         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
3019         both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.         both in and out of UTF-8 mode. Unlike a dot, it always matches  CR  and
3020         The feature is provided in Perl in order to match individual  bytes  in         LF.  The feature is provided in Perl in order to match individual bytes
3021         UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual         in UTF-8 mode.  Because it breaks up UTF-8 characters  into  individual
3022         bytes, what remains in the string may be a malformed UTF-8 string.  For         bytes,  what remains in the string may be a malformed UTF-8 string. For
3023         this reason, the \C escape sequence is best avoided.         this reason, the \C escape sequence is best avoided.
3024    
3025         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
3026         below), because in UTF-8 mode this would make it impossible  to  calcu-         below),  because  in UTF-8 mode this would make it impossible to calcu-
3027         late the length of the lookbehind.         late the length of the lookbehind.
3028    
3029    
# Line 2267  SQUARE BRACKETS AND CHARACTER CLASSES Line 3032  SQUARE BRACKETS AND CHARACTER CLASSES
3032         An opening square bracket introduces a character class, terminated by a         An opening square bracket introduces a character class, terminated by a
3033         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
3034         cial. If a closing square bracket is required as a member of the class,         cial. If a closing square bracket is required as a member of the class,
3035         it should be the first data character in the class  (after  an  initial         it  should  be  the first data character in the class (after an initial
3036         circumflex, if present) or escaped with a backslash.         circumflex, if present) or escaped with a backslash.
3037    
3038         A  character  class matches a single character in the subject. In UTF-8         A character class matches a single character in the subject.  In  UTF-8
3039         mode, the character may occupy more than one byte. A matched  character         mode,  the character may occupy more than one byte. A matched character
3040         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
3041         character in the class definition is a circumflex, in  which  case  the         character  in  the  class definition is a circumflex, in which case the
3042         subject  character  must  not  be in the set defined by the class. If a         subject character must not be in the set defined by  the  class.  If  a
3043         circumflex is actually required as a member of the class, ensure it  is         circumflex  is actually required as a member of the class, ensure it is
3044         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
3045    
3046         For  example, the character class [aeiou] matches any lower case vowel,         For example, the character class [aeiou] matches any lower case  vowel,
3047         while [^aeiou] matches any character that is not a  lower  case  vowel.         while  [^aeiou]  matches  any character that is not a lower case vowel.
3048         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
3049         characters that are in the class by enumerating those that are  not.  A         characters  that  are in the class by enumerating those that are not. A
3050         class  that starts with a circumflex is not an assertion: it still con-         class that starts with a circumflex is not an assertion: it still  con-
3051         sumes a character from the subject string, and therefore  it  fails  if         sumes  a  character  from the subject string, and therefore it fails if
3052         the current pointer is at the end of the string.         the current pointer is at the end of the string.
3053    
3054         In  UTF-8 mode, characters with values greater than 255 can be included         In UTF-8 mode, characters with values greater than 255 can be  included
3055         in a class as a literal string of bytes, or by using the  \x{  escaping         in  a  class as a literal string of bytes, or by using the \x{ escaping
3056         mechanism.         mechanism.
3057    
3058         When  caseless  matching  is set, any letters in a class represent both         When caseless matching is set, any letters in a  class  represent  both
3059         their upper case and lower case versions, so for  example,  a  caseless         their  upper  case  and lower case versions, so for example, a caseless
3060         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
3061         match "A", whereas a caseful version would. When running in UTF-8 mode,         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
3062         PCRE  supports  the  concept of case for characters with values greater         understands the concept of case for characters whose  values  are  less
3063         than 128 only when it is compiled with Unicode property support.         than  128, so caseless matching is always possible. For characters with
3064           higher values, the concept of case is supported  if  PCRE  is  compiled
3065         The newline character is never treated in any special way in  character         with  Unicode  property support, but not otherwise.  If you want to use
3066         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE         caseless matching for characters 128 and above, you  must  ensure  that
3067         options is. A class such as [^a] will always match a newline.         PCRE  is  compiled  with Unicode property support as well as with UTF-8
3068           support.
3069    
3070           Characters that might indicate  line  breaks  (CR  and  LF)  are  never
3071           treated  in  any  special way when matching character classes, whatever
3072           line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
3073           and PCRE_MULTILINE options is used. A class such as [^a] always matches
3074           one of these characters.
3075    
3076         The minus (hyphen) character can be used to specify a range of  charac-         The minus (hyphen) character can be used to specify a range of  charac-
3077         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
# Line 2400  VERTICAL BAR Line 3172  VERTICAL BAR
3172    
3173         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches  either "gilbert" or "sullivan". Any number of alternatives may
3174         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
3175         string).   The  matching  process  tries each alternative in turn, from         string). The matching process tries each alternative in turn, from left
3176         left to right, and the first one that succeeds is used. If the alterna-         to right, and the first one that succeeds is used. If the  alternatives
3177         tives  are within a subpattern (defined below), "succeeds" means match-         are  within a subpattern (defined below), "succeeds" means matching the
3178         ing the rest of the main pattern as well as the alternative in the sub-         rest of the main pattern as well as the alternative in the  subpattern.
        pattern.  
3179    
3180    
3181  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
# Line 2450  INTERNAL OPTION SETTING Line 3221  INTERNAL OPTION SETTING
3221         the effects of option settings happen at compile time. There  would  be         the effects of option settings happen at compile time. There  would  be
3222         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3223    
3224         The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
3225         in the same way as the Perl-compatible options by using the  characters         can be changed in the same way as the Perl-compatible options by  using
3226         U  and X respectively. The (?X) flag setting is special in that it must         the characters J, U and X respectively.
        always occur earlier in the pattern than any of the additional features  
        it  turns on, even when it is at top level. It is best to put it at the  
        start.  
3227    
3228    
3229  SUBPATTERNS  SUBPATTERNS
# Line 2467  SUBPATTERNS Line 3235  SUBPATTERNS
3235    
3236           cat(aract|erpillar|)           cat(aract|erpillar|)
3237    
3238         matches  one  of the words "cat", "cataract", or "caterpillar". Without         matches one of the words "cat", "cataract", or  "caterpillar".  Without
3239         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the  parentheses,  it  would  match "cataract", "erpillar" or the empty
3240         string.         string.
3241    
3242         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2. It sets up the subpattern as  a  capturing  subpattern.  This  means
3243         that, when the whole pattern  matches,  that  portion  of  the  subject         that,  when  the  whole  pattern  matches,  that portion of the subject
3244         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
3245         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector  argument  of pcre_exec(). Opening parentheses are counted from
3246         left  to  right  (starting  from 1) to obtain numbers for the capturing         left to right (starting from 1) to obtain  numbers  for  the  capturing
3247         subpatterns.         subpatterns.
3248    
3249         For example, if the string "the red king" is matched against  the  pat-         For  example,  if the string "the red king" is matched against the pat-
3250         tern         tern
3251    
3252           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 2486  SUBPATTERNS Line 3254  SUBPATTERNS
3254         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
3255         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
3256    
3257         The fact that plain parentheses fulfil  two  functions  is  not  always         The  fact  that  plain  parentheses  fulfil two functions is not always
3258         helpful.   There are often times when a grouping subpattern is required         helpful.  There are often times when a grouping subpattern is  required
3259         without a capturing requirement. If an opening parenthesis is  followed         without  a capturing requirement. If an opening parenthesis is followed
3260         by  a question mark and a colon, the subpattern does not do any captur-         by a question mark and a colon, the subpattern does not do any  captur-
3261         ing, and is not counted when computing the  number  of  any  subsequent         ing,  and  is  not  counted when computing the number of any subsequent
3262         capturing  subpatterns. For example, if the string "the white queen" is         capturing subpatterns. For example, if the string "the white queen"  is
3263         matched against the pattern         matched against the pattern
3264    
3265           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
3266    
3267         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
3268         1  and 2. The maximum number of capturing subpatterns is 65535, and the         1 and 2. The maximum number of capturing subpatterns is 65535, and  the
3269         maximum depth of nesting of all subpatterns, both  capturing  and  non-         maximum  depth  of  nesting of all subpatterns, both capturing and non-
3270         capturing, is 200.         capturing, is 200.
3271    
3272         As  a  convenient shorthand, if any option settings are required at the         As a convenient shorthand, if any option settings are required  at  the
3273         start of a non-capturing subpattern,  the  option  letters  may  appear         start  of  a  non-capturing  subpattern,  the option letters may appear
3274         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
3275    
3276           (?i:saturday|sunday)           (?i:saturday|sunday)
3277           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
3278    
3279         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
3280         tried from left to right, and options are not reset until  the  end  of         tried  from  left  to right, and options are not reset until the end of
3281         the  subpattern is reached, an option setting in one branch does affect         the subpattern is reached, an option setting in one branch does  affect
3282         subsequent branches, so the above patterns match "SUNDAY"  as  well  as         subsequent  branches,  so  the above patterns match "SUNDAY" as well as
3283         "Saturday".         "Saturday".
3284    
3285    
3286  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3287    
3288         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
3289         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
3290         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
3291         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
3292         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns, something that Perl  does  not  provide.  The  Python  syntax
3293         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         (?P<name>...)  is  used. References to capturing parentheses from other
3294         underscores, and must be unique within a pattern.         parts of the pattern, such as  backreferences,  recursion,  and  condi-
3295           tions, can be made by name as well as by number.
3296    
3297         Named  capturing  parentheses  are  still  allocated numbers as well as         Names  consist  of  up  to  32 alphanumeric characters and underscores.
3298           Named capturing parentheses are still  allocated  numbers  as  well  as
3299         names. The PCRE API provides function calls for extracting the name-to-         names. The PCRE API provides function calls for extracting the name-to-
3300         number  translation table from a compiled pattern. There is also a con-         number translation table from a compiled pattern. There is also a  con-
3301         venience function for extracting a captured substring by name. For fur-         venience function for extracting a captured substring by name.
3302         ther details see the pcreapi documentation.  
3303           By  default, a name must be unique within a pattern, but it is possible
3304           to relax this constraint by setting the PCRE_DUPNAMES option at compile
3305           time.  This  can  be useful for patterns where only one instance of the
3306           named parentheses can match. Suppose you want to match the  name  of  a
3307           weekday,  either as a 3-letter abbreviation or as the full name, and in
3308           both cases you want to extract the abbreviation. This pattern (ignoring
3309           the line breaks) does the job:
3310    
3311             (?P<DN>Mon|Fri|Sun)(?:day)?|
3312             (?P<DN>Tue)(?:sday)?|
3313             (?P<DN>Wed)(?:nesday)?|
3314             (?P<DN>Thu)(?:rsday)?|
3315             (?P<DN>Sat)(?:urday)?
3316    
3317           There  are  five capturing substrings, but only one is ever set after a
3318           match.  The convenience  function  for  extracting  the  data  by  name
3319           returns  the  substring  for  the first, and in this example, the only,
3320           subpattern of that name that matched.  This  saves  searching  to  find
3321           which  numbered  subpattern  it  was. If you make a reference to a non-
3322           unique named subpattern from elsewhere in the  pattern,  the  one  that
3323           corresponds  to  the  lowest number is used. For further details of the
3324           interfaces for handling named subpatterns, see the  pcreapi  documenta-
3325           tion.
3326    
3327    
3328  REPETITION  REPETITION
# Line 2738  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3531  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3531         meaning  or  processing  of  a possessive quantifier and the equivalent         meaning  or  processing  of  a possessive quantifier and the equivalent
3532         atomic group.         atomic group.
3533    
3534         The possessive quantifier syntax is an extension to the Perl syntax. It         The possessive quantifier syntax is an extension to  the  Perl  syntax.
3535         originates in Sun's Java package.         Jeffrey  Friedl originated the idea (and the name) in the first edition
3536           of his book.  Mike McCloskey liked it, so implemented it when he  built
3537           Sun's Java package, and PCRE copied it from there.
3538    
3539         When  a  pattern  contains an unlimited repeat inside a subpattern that         When  a  pattern  contains an unlimited repeat inside a subpattern that
3540         can itself be repeated an unlimited number of  times,  the  use  of  an         can itself be repeated an unlimited number of  times,  the  use  of  an
# Line 2780  BACK REFERENCES Line 3575  BACK REFERENCES
3575         it  is  always  taken  as a back reference, and causes an error only if         it  is  always  taken  as a back reference, and causes an error only if
3576         there are not that many capturing left parentheses in the  entire  pat-         there are not that many capturing left parentheses in the  entire  pat-
3577         tern.  In  other words, the parentheses that are referenced need not be         tern.  In  other words, the parentheses that are referenced need not be
3578         to the left of the reference for numbers less than 10. See the  subsec-         to the left of the reference for numbers less than 10. A "forward  back
3579         tion  entitled  "Non-printing  characters" above for further details of         reference"  of  this  type can make sense when a repetition is involved
3580         the handling of digits following a backslash.         and the subpattern to the right has participated in an  earlier  itera-
3581           tion.
3582    
3583         A back reference matches whatever actually matched the  capturing  sub-         It is not possible to have a numerical "forward back reference" to sub-
3584         pattern  in  the  current subject string, rather than anything matching         pattern whose number is 10 or more. However, a back  reference  to  any
3585           subpattern  is  possible  using named parentheses (see below). See also
3586           the subsection entitled "Non-printing  characters"  above  for  further
3587           details of the handling of digits following a backslash.
3588    
3589           A  back  reference matches whatever actually matched the capturing sub-
3590           pattern in the current subject string, rather  than  anything  matching
3591         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3592         of doing that). So the pattern         of doing that). So the pattern
3593    
3594           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3595    
3596         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
3597         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
3598         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
3599         ple,         ple,
3600    
3601           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3602    
3603         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3604         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3605    
3606         Back  references  to named subpatterns use the Python syntax (?P=name).         Back references to named subpatterns use the Python  syntax  (?P=name).
3607         We could rewrite the above example as follows:         We could rewrite the above example as follows:
3608    
3609           (?<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
3610    
3611           A  subpattern  that  is  referenced  by  name may appear in the pattern
3612           before or after the reference.
3613    
3614         There may be more than one back reference to the same subpattern. If  a         There may be more than one back reference to the same subpattern. If  a
3615         subpattern  has  not actually been used in a particular match, any back         subpattern  has  not actually been used in a particular match, any back
# Line 2893  ASSERTIONS Line 3698  ASSERTIONS
3698         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
3699         contents of a lookbehind assertion are restricted  such  that  all  the         contents of a lookbehind assertion are restricted  such  that  all  the
3700         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
3701         eral alternatives, they do not all have to have the same fixed  length.         eral top-level alternatives, they do not all  have  to  have  the  same
3702         Thus         fixed length. Thus
3703    
3704           (?<=bullock|donkey)           (?<=bullock|donkey)
3705    
# Line 3007  CONDITIONAL SUBPATTERNS Line 3812  CONDITIONAL SUBPATTERNS
3812         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
3813    
3814         There are three kinds of condition. If the text between the parentheses         There are three kinds of condition. If the text between the parentheses
3815         consists of a sequence of digits, the condition  is  satisfied  if  the         consists of a sequence of digits, or a sequence of alphanumeric charac-
3816         capturing  subpattern of that number has previously matched. The number         ters  and underscores, the condition is satisfied if the capturing sub-
3817         must be greater than zero. Consider the following pattern,  which  con-         pattern of that number or name has previously matched. There is a  pos-
3818         tains  non-significant white space to make it more readable (assume the         sible  ambiguity here, because subpattern names may consist entirely of
3819         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of         digits. PCRE looks first for a named subpattern; if it cannot find  one
3820         discussion:         and  the text consists entirely of digits, it looks for a subpattern of
3821           that number, which must be greater than zero.  Using  subpattern  names
3822           that consist entirely of digits is not recommended.
3823    
3824           Consider  the  following  pattern, which contains non-significant white
3825           space to make it more readable (assume the PCRE_EXTENDED option) and to
3826           divide it into three parts for ease of discussion:
3827    
3828           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
3829    
# Line 3025  CONDITIONAL SUBPATTERNS Line 3836  CONDITIONAL SUBPATTERNS
3836         tern  is  executed  and  a  closing parenthesis is required. Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
3837         since no-pattern is not present, the  subpattern  matches  nothing.  In         since no-pattern is not present, the  subpattern  matches  nothing.  In
3838         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
3839         optionally enclosed in parentheses.         optionally enclosed in parentheses. Rewriting it to use a named subpat-
3840           tern gives this:
3841    
3842         If the condition is the string (R), it is satisfied if a recursive call           (?P<OPEN> \( )?    [^()]+    (?(OPEN) \) )
3843         to  the pattern or subpattern has been made. At "top level", the condi-  
3844         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are         If the condition is the string (R), and there is no subpattern with the
3845         described in the next section.         name R, the condition is satisfied if a recursive call to  the  pattern
3846           or  subpattern  has  been made. At "top level", the condition is false.
3847           This is a PCRE extension.  Recursive patterns are described in the next
3848           section.
3849    
3850         If  the  condition  is  not  a sequence of digits or (R), it must be an         If  the  condition  is  not  a sequence of digits or (R), it must be an
3851         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
# Line 3057  COMMENTS Line 3872  COMMENTS
3872         at all.         at all.
3873    
3874         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
3875         character class introduces a comment that continues up to the next new-         character class introduces a  comment  that  continues  to  immediately
3876         line character in the pattern.         after the next newline in the pattern.
3877    
3878    
3879  RECURSIVE PATTERNS  RECURSIVE PATTERNS
# Line 3088  RECURSIVE PATTERNS Line 3903  RECURSIVE PATTERNS
3903         tion.)  The special item (?R) is a recursive call of the entire regular         tion.)  The special item (?R) is a recursive call of the entire regular
3904         expression.         expression.
3905    
3906         For example, this PCRE pattern solves the  nested  parentheses  problem         A recursive subpattern call is always treated as an atomic group.  That
3907         (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is         is,  once  it  has  matched some of the subject string, it is never re-
3908         ignored):         entered, even if it contains untried alternatives and there is a subse-
3909           quent matching failure.
3910    
3911           This  PCRE  pattern  solves  the nested parentheses problem (assume the
3912           PCRE_EXTENDED option is set so that white space is ignored):
3913    
3914           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
3915    
3916         First it matches an opening parenthesis. Then it matches any number  of         First it matches an opening parenthesis. Then it matches any number  of
3917         substrings  which  can  either  be  a sequence of non-parentheses, or a         substrings  which  can  either  be  a sequence of non-parentheses, or a
3918         recursive match of the pattern itself (that is  a  correctly  parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
3919         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
3920    
3921         If  this  were  part of a larger pattern, you would not want to recurse         If  this  were  part of a larger pattern, you would not want to recurse
# Line 3177  SUBPATTERNS AS SUBROUTINES Line 3996  SUBPATTERNS AS SUBROUTINES
3996           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
3997    
3998         is  used, it does match "sense and responsibility" as well as the other         is  used, it does match "sense and responsibility" as well as the other
3999         two strings. Such references must, however, follow  the  subpattern  to         two strings. Such references, if given  numerically,  must  follow  the
4000         which they refer.         subpattern  to which they refer. However, named references can refer to
4001           later subpatterns.
4002    
4003           Like recursive subpatterns, a "subroutine" call is always treated as an
4004           atomic  group. That is, once it has matched some of the subject string,
4005           it is never re-entered, even if it contains  untried  alternatives  and
4006           there is a subsequent matching failure.
4007    
4008    
4009  CALLOUTS  CALLOUTS
# Line 3215  CALLOUTS Line 4040  CALLOUTS
4040         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4041         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4042    
4043  Last updated: 09 September 2004  Last updated: 06 June 2006
4044  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
4045  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
4046    
 PCRE(3)                                                                PCRE(3)  
4047    
4048    PCREPARTIAL(3)                                                  PCREPARTIAL(3)
4049    
4050    
4051  NAME  NAME
4052         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4053    
4054    
4055  PARTIAL MATCHING IN PCRE  PARTIAL MATCHING IN PCRE
4056    
4057         In  normal  use  of  PCRE,  if  the  subject  string  that is passed to         In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
4058         pcre_exec() matches as far as it goes, but is too short  to  match  the         pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
4059         entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances         short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
4060         where it might be helpful to distinguish this case from other cases  in         There are circumstances where it might be helpful to  distinguish  this
4061         which there is no match.         case from other cases in which there is no match.
4062    
4063         Consider, for example, an application where a human is required to type         Consider, for example, an application where a human is required to type
4064         in data for a field with specific formatting requirements.  An  example         in data for a field with specific formatting requirements.  An  example
# Line 3248  PARTIAL MATCHING IN PCRE Line 4074  PARTIAL MATCHING IN PCRE
4074         until the entire string has been entered.         until the entire string has been entered.
4075    
4076         PCRE supports the concept of partial matching by means of the PCRE_PAR-         PCRE supports the concept of partial matching by means of the PCRE_PAR-
4077         TIAL  option,  which  can be set when calling pcre_exec(). When this is         TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or
4078         done,  the   return   code   PCRE_ERROR_NOMATCH   is   converted   into         pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
4079         PCRE_ERROR_PARTIAL  if  at  any  time  during  the matching process the         PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time
4080         entire subject string matched part of the pattern. No captured data  is         during the matching process the last part of the subject string matched
4081         set when this occurs.         part  of  the  pattern. Unfortunately, for non-anchored matching, it is
4082           not possible to obtain the position of the start of the partial  match.
4083           No captured data is set when PCRE_ERROR_PARTIAL is returned.
4084    
4085           When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code
4086           PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of
4087           the  subject is reached, there have been no complete matches, but there
4088           is still at least one matching possibility. The portion of  the  string
4089           that provided the partial match is set as the first matching string.
4090    
4091         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
4092         the last literal byte in a pattern, and abandons  matching  immediately         the last literal byte in a pattern, and abandons  matching  immediately
# Line 3263  PARTIAL MATCHING IN PCRE Line 4097  PARTIAL MATCHING IN PCRE
4097  RESTRICTED PATTERNS FOR PCRE_PARTIAL  RESTRICTED PATTERNS FOR PCRE_PARTIAL
4098    
4099         Because of the way certain internal optimizations  are  implemented  in         Because of the way certain internal optimizations  are  implemented  in
4100         PCRE,  the  PCRE_PARTIAL  option  cannot  be  used  with  all patterns.         the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with
4101         Repeated single characters such as         all patterns. These restrictions do not apply when  pcre_dfa_exec()  is
4102           used.  For pcre_exec(), repeated single characters such as
4103    
4104           a{2,4}           a{2,4}
4105    
# Line 3272  RESTRICTED PATTERNS FOR PCRE_PARTIAL Line 4107  RESTRICTED PATTERNS FOR PCRE_PARTIAL
4107    
4108           \d+           \d+
4109    
4110         are not permitted if the maximum number of occurrences is greater  than         are  not permitted if the maximum number of occurrences is greater than
4111         one.  Optional items such as \d? (where the maximum is one) are permit-         one.  Optional items such as \d? (where the maximum is one) are permit-
4112         ted.  Quantifiers with any values are permitted after  parentheses,  so         ted.   Quantifiers  with any values are permitted after parentheses, so
4113         the invalid examples above can be coded thus:         the invalid examples above can be coded thus:
4114    
4115           (a){2,4}           (a){2,4}
4116           (\d)+           (\d)+
4117    
4118         These  constructions  run more slowly, but for the kinds of application         These constructions run more slowly, but for the kinds  of  application
4119         that are envisaged for this facility, this is not felt to  be  a  major         that  are  envisaged  for this facility, this is not felt to be a major
4120         restriction.         restriction.
4121    
4122         If  PCRE_PARTIAL  is  set  for  a  pattern that does not conform to the         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
4123         restrictions, pcre_exec() returns the error code  PCRE_ERROR_BADPARTIAL         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
4124         (-13).         (-13).
4125    
4126    
4127  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
4128    
4129         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If the escape sequence \P is present  in  a  pcretest  data  line,  the
4130         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
4131         uses the date example quoted above:         uses the date example quoted above:
4132    
4133             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4134           data> 25jun04P           data> 25jun04\P
4135            0: 25jun04            0: 25jun04
4136            1: jun            1: jun
4137           data> 25dec3P           data> 25dec3\P
4138           Partial match           Partial match
4139           data> 3juP           data> 3ju\P
4140           Partial match           Partial match
4141           data> 3jujP           data> 3juj\P
4142           No match           No match
4143           data> jP           data> j\P
4144           No match           No match
4145    
4146         The  first  data  string  is  matched completely, so pcretest shows the         The first data string is matched  completely,  so  pcretest  shows  the
4147         matched substrings. The remaining four strings do not  match  the  com-         matched  substrings.  The  remaining four strings do not match the com-
4148         plete pattern, but the first two are partial matches.         plete pattern, but the first two are partial matches.  The  same  test,
4149           using  DFA  matching (by means of the \D escape sequence), produces the
4150           following output:
4151    
4152  Last updated: 08 September 2004             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
4153  Copyright (c) 1997-2004 University of Cambridge.           data> 25jun04\P\D
4154  -----------------------------------------------------------------------------            0: 25jun04
4155             data> 23dec3\P\D
4156             Partial match: 23dec3
4157             data> 3ju\P\D
4158             Partial match: 3ju
4159             data> 3juj\P\D
4160             No match
4161             data> j\P\D
4162             No match
4163    
4164  PCRE(3)                                                                PCRE(3)         Notice that in this case the portion of the string that was matched  is
4165           made available.
4166    
4167    
4168    MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
4169    
4170           When a partial match has been found using pcre_dfa_exec(), it is possi-
4171           ble to continue the match by  providing  additional  subject  data  and
4172           calling  pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the
4173           same working space (where details of the  previous  partial  match  are
4174           stored).  Here  is  an  example  using  pcretest,  where  the \R escape
4175           sequence sets the PCRE_DFA_RESTART option and the  \D  escape  sequence
4176           requests the use of pcre_dfa_exec():
4177    
4178               re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
4179             data> 23ja\P\D
4180             Partial match: 23ja
4181             data> n05\R\D
4182              0: n05
4183    
4184           The  first  call has "23ja" as the subject, and requests partial match-
4185           ing; the second call  has  "n05"  as  the  subject  for  the  continued
4186           (restarted)  match.   Notice  that when the match is complete, only the
4187           last part is shown; PCRE does  not  retain  the  previously  partially-
4188           matched  string. It is up to the calling program to do that if it needs
4189           to.
4190    
4191           This facility can  be  used  to  pass  very  long  subject  strings  to
4192           pcre_dfa_exec(). However, some care is needed for certain types of pat-
4193           tern.
4194    
4195           1. If the pattern contains tests for the beginning or end  of  a  line,
4196           you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
4197           ate, when the subject string for any call does not contain  the  begin-
4198           ning or end of a line.
4199    
4200           2.  If  the  pattern contains backward assertions (including \b or \B),
4201           you need to arrange for some overlap in the subject  strings  to  allow
4202           for  this.  For example, you could pass the subject in chunks that were
4203           500 bytes long, but in a buffer of 700 bytes, with the starting  offset
4204           set to 200 and the previous 200 bytes at the start of the buffer.
4205    
4206           3.  Matching a subject string that is split into multiple segments does
4207           not always produce exactly the same result as matching over one  single
4208           long  string.   The  difference arises when there are multiple matching
4209           possibilities, because a partial match result is given only when  there
4210           are  no  completed  matches  in a call to fBpcre_dfa_exec(). This means
4211           that as soon as the shortest match has been found,  continuation  to  a
4212           new  subject  segment  is  no  longer possible.  Consider this pcretest
4213           example:
4214    
4215               re> /dog(sbody)?/
4216             data> do\P\D
4217             Partial match: do
4218             data> gsb\R\P\D
4219              0: g
4220             data> dogsbody\D
4221              0: dogsbody
4222              1: dog
4223    
4224           The pattern matches the words "dog" or "dogsbody". When the subject  is
4225           presented  in  several  parts  ("do" and "gsb" being the first two) the
4226           match stops when "dog" has been found, and it is not possible  to  con-
4227           tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single
4228           string, both matches are found.
4229    
4230           Because of this phenomenon, it does not usually make  sense  to  end  a
4231           pattern that is going to be matched in this way with a variable repeat.
4232    
4233           4. Patterns that contain alternatives at the top level which do not all
4234           start with the same pattern item may not work as expected. For example,
4235           consider this pattern:
4236    
4237             1234|3789
4238    
4239           If the first part of the subject is "ABC123", a partial  match  of  the
4240           first  alternative  is found at offset 3. There is no partial match for
4241           the second alternative, because such a match does not start at the same
4242           point  in  the  subject  string. Attempting to continue with the string
4243           "789" does not yield a match because only those alternatives that match
4244           at  one point in the subject are remembered. The problem arises because
4245           the start of the second alternative matches within the  first  alterna-
4246           tive. There is no problem with anchored patterns or patterns such as:
4247    
4248             1234|ABCD
4249    
4250           where no string can be a partial match for both alternatives.
4251    
4252    Last updated: 16 January 2006
4253    Copyright (c) 1997-2006 University of Cambridge.
4254    ------------------------------------------------------------------------------
4255    
4256    
4257    PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
4258    
4259    
4260  NAME  NAME
4261         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4262    
4263    
4264  SAVING AND RE-USING PRECOMPILED PCRE PATTERNS  SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
4265    
4266         If  you  are running an application that uses a large number of regular         If  you  are running an application that uses a large number of regular
# Line 3391  SAVING A COMPILED PATTERN Line 4329  SAVING A COMPILED PATTERN
4329  RE-USING A PRECOMPILED PATTERN  RE-USING A PRECOMPILED PATTERN
4330    
4331         Re-using a precompiled pattern is straightforward. Having  reloaded  it         Re-using a precompiled pattern is straightforward. Having  reloaded  it
4332         into main memory, you pass its pointer to pcre_exec() in the usual way.         into   main   memory,   you   pass   its   pointer  to  pcre_exec()  or
4333         This should work even on another host, and even if that  host  has  the         pcre_dfa_exec() in the usual way. This  should  work  even  on  another
4334         opposite endianness to the one where the pattern was compiled.         host,  and  even  if  that  host has the opposite endianness to the one
4335           where the pattern was compiled.
4336         However,  if  you  passed a pointer to custom character tables when the  
4337         pattern was compiled (the tableptr  argument  of  pcre_compile()),  you         However, if you passed a pointer to custom character  tables  when  the
4338         must now pass a similar pointer to pcre_exec(), because the value saved         pattern  was  compiled  (the  tableptr argument of pcre_compile()), you
4339         with the compiled pattern will obviously be  nonsense.  A  field  in  a         must now pass a similar  pointer  to  pcre_exec()  or  pcre_dfa_exec(),
4340         pcre_extra()  block is used to pass this data, as described in the sec-         because  the  value  saved  with the compiled pattern will obviously be
4341         tion on matching a pattern in the pcreapi documentation.         nonsense. A field in a pcre_extra() block is used to pass this data, as
4342           described  in the section on matching a pattern in the pcreapi documen-
4343           tation.
4344    
4345         If you did not provide custom character tables  when  the  pattern  was         If you did not provide custom character tables  when  the  pattern  was
4346         compiled,  the  pointer  in  the compiled pattern is NULL, which causes         compiled,  the  pointer  in  the compiled pattern is NULL, which causes
# Line 3411  RE-USING A PRECOMPILED PATTERN Line 4351  RE-USING A PRECOMPILED PATTERN
4351         your own pcre_extra data block and set the study_data field to point to         your own pcre_extra data block and set the study_data field to point to
4352         the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA         the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA
4353         bit in the flags field to indicate that study  data  is  present.  Then         bit in the flags field to indicate that study  data  is  present.  Then
4354         pass the pcre_extra block to pcre_exec() in the usual way.         pass  the  pcre_extra  block  to  pcre_exec() or pcre_dfa_exec() in the
4355           usual way.
4356    
4357    
4358  COMPATIBILITY WITH DIFFERENT PCRE RELEASES  COMPATIBILITY WITH DIFFERENT PCRE RELEASES
4359    
4360         The  layout  of the control block that is at the start of the data that         The layout of the control block that is at the start of the  data  that
4361         makes up a compiled pattern was changed for release 5.0.  If  you  have         makes  up  a  compiled pattern was changed for release 5.0. If you have
4362         any  saved  patterns  that  were compiled with previous releases (not a         any saved patterns that were compiled with  previous  releases  (not  a
4363         facility that was previously advertised), you will  have  to  recompile         facility  that  was  previously advertised), you will have to recompile
4364         them  for  release  5.0. However, from now on, it should be possible to         them for release 5.0. However, from now on, it should  be  possible  to
4365         make changes in a compabible manner.         make changes in a compatible manner.
4366    
4367           Notwithstanding the above, if you have any saved patterns in UTF-8 mode
4368           that use \p or \P that were compiled with any release up to and includ-
4369           ing 6.4, you will have to recompile them for release 6.5 and above.
4370    
4371    Last updated: 01 February 2006
4372    Copyright (c) 1997-2006 University of Cambridge.
4373    ------------------------------------------------------------------------------
4374    
 Last updated: 10 September 2004  
 Copyright (c) 1997-2004 University of Cambridge.  
 -----------------------------------------------------------------------------  
   
 PCRE(3)                                                                PCRE(3)  
4375    
4376    PCREPERFORM(3)                                                  PCREPERFORM(3)
4377    
4378    
4379  NAME  NAME
4380         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4381    
4382    
4383  PCRE PERFORMANCE  PCRE PERFORMANCE
4384    
4385         Certain  items  that may appear in regular expression patterns are more         Certain  items  that may appear in regular expression patterns are more
# Line 3469  PCRE PERFORMANCE Line 4415  PCRE PERFORMANCE
4415    
4416         If you are using such a pattern with subject strings that do  not  con-         If you are using such a pattern with subject strings that do  not  con-
4417         tain newlines, the best performance is obtained by setting PCRE_DOTALL,         tain newlines, the best performance is obtained by setting PCRE_DOTALL,
4418         or starting the pattern with ^.* to indicate explicit  anchoring.  That         or starting the pattern with ^.* or ^.*? to indicate  explicit  anchor-
4419         saves  PCRE from having to scan along the subject looking for a newline         ing.  That saves PCRE from having to scan along the subject looking for
4420         to restart at.         a newline to restart at.
4421    
4422         Beware of patterns that contain nested indefinite  repeats.  These  can         Beware of patterns that contain nested indefinite  repeats.  These  can
4423         take  a  long time to run when applied to a string that does not match.         take  a  long time to run when applied to a string that does not match.
# Line 3492  PCRE PERFORMANCE Line 4438  PCRE PERFORMANCE
4438           (a+)*b           (a+)*b
4439    
4440         where a literal character follows. Before  embarking  on  the  standard         where a literal character follows. Before  embarking  on  the  standard
4441         matching  procedure,  PCRE  checks  that  there  is  a "b" later in the         matching  procedure,  PCRE checks that there is a "b" later in the sub-
4442         subject string, and if there is not, it fails  the  match  immediately.         ject string, and if there is not, it fails the match immediately.  How-
4443         However, when there is no following literal this optimization cannot be         ever,  when  there  is no following literal this optimization cannot be
4444         used. You can see the difference by comparing the behaviour of         used. You can see the difference by comparing the behaviour of
4445    
4446           (a+)*\d           (a+)*\d
# Line 3506  PCRE PERFORMANCE Line 4452  PCRE PERFORMANCE
4452         In many cases, the solution to this kind of performance issue is to use         In many cases, the solution to this kind of performance issue is to use
4453         an atomic group or a possessive quantifier.         an atomic group or a possessive quantifier.
4454    
4455  Last updated: 09 September 2004  Last updated: 28 February 2005
4456  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
4457  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
4458    
 PCRE(3)                                                                PCRE(3)  
4459    
4460    PCREPOSIX(3)                                                      PCREPOSIX(3)
4461    
4462    
4463  NAME  NAME
4464         PCRE - Perl-compatible regular expressions.         PCRE - Perl-compatible regular expressions.
4465    
4466    
4467  SYNOPSIS OF POSIX API  SYNOPSIS OF POSIX API
4468    
4469         #include <pcreposix.h>         #include <pcreposix.h>
# Line 3537  DESCRIPTION Line 4484  DESCRIPTION
4484    
4485         This  set  of  functions provides a POSIX-style API to the PCRE regular         This  set  of  functions provides a POSIX-style API to the PCRE regular
4486         expression package. See the pcreapi documentation for a description  of         expression package. See the pcreapi documentation for a description  of
4487         PCRE's native API, which contains additional functionality.         PCRE's native API, which contains much additional functionality.
4488    
4489         The functions described here are just wrapper functions that ultimately         The functions described here are just wrapper functions that ultimately
4490         call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the         call  the  PCRE  native  API.  Their  prototypes  are  defined