/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 87 by nigel, Sat Feb 24 21:41:21 2007 UTC
# Line 6  synopses of each function in the library Line 6  synopses of each function in the library
6  separate text files for the pcregrep and pcretest commands.  separate text files for the pcregrep and pcretest commands.
7  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
8    
 PCRE(3)                                                                PCRE(3)  
9    
10    PCRE(3)                                                                PCRE(3)
11    
12    
13  NAME  NAME
14         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
15    
16    
17  INTRODUCTION  INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few  differences.  The current implementation of PCRE (release
22         5.x) corresponds approximately with Perl  5.8,  including  support  for         6.x) corresponds approximately with Perl  5.8,  including  support  for
23         UTF-8 encoded strings and Unicode general category properties. However,         UTF-8 encoded strings and Unicode general category properties. However,
24         this support has to be explicitly enabled; it is not the default.         this support has to be explicitly enabled; it is not the default.
25    
26           In addition to the Perl-compatible matching function,  PCRE  also  con-
27           tains  an  alternative matching function that matches the same compiled
28           patterns in a different way. In certain circumstances, the  alternative
29           function  has  some  advantages.  For  a discussion of the two matching
30           algorithms, see the pcrematching page.
31    
32         PCRE is written in C and released as a C library. A  number  of  people         PCRE is written in C and released as a C library. A  number  of  people
33         have  written  wrappers and interfaces of various kinds. A C++ class is         have  written  wrappers and interfaces of various kinds. In particular,
34         included in these contributions, which can  be  found  in  the  Contrib         Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now
35         directory at the primary FTP site, which is:         included as part of the PCRE distribution. The pcrecpp page has details
36           of this interface. Other people's contributions can  be  found  in  the
37           Contrib directory at the primary FTP site, which is:
38    
39         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
40    
# Line 40  INTRODUCTION Line 49  INTRODUCTION
49         ing  PCRE for various operating systems can be found in the README file         ing  PCRE for various operating systems can be found in the README file
50         in the source distribution.         in the source distribution.
51    
52           The library contains a number of undocumented  internal  functions  and
53           data  tables  that  are  used by more than one of the exported external
54           functions, but which are not intended  for  use  by  external  callers.
55           Their  names  all begin with "_pcre_", which hopefully will not provoke
56           any name clashes. In some environments, it is possible to control which
57           external  symbols  are  exported when a shared library is built, and in
58           these cases the undocumented symbols are not exported.
59    
60    
61  USER DOCUMENTATION  USER DOCUMENTATION
62    
# Line 50  USER DOCUMENTATION Line 67  USER DOCUMENTATION
67         of searching. The sections are as follows:         of searching. The sections are as follows:
68    
69           pcre              this document           pcre              this document
70           pcreapi           details of PCRE's native API           pcreapi           details of PCRE's native C API
71           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
72           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
73           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
74             pcrecpp           details of the C++ wrapper
75           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
76             pcrematching      discussion of the two matching algorithms
77           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
78           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
79                               regular expressions                               regular expressions
80           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
81           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible C API
82           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
83           pcresample        discussion of the sample program           pcresample        discussion of the sample program
84           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
85    
86         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
87         each library function, listing its arguments and results.         each C library function, listing its arguments and results.
88    
89    
90  LIMITATIONS  LIMITATIONS
# Line 90  LIMITATIONS Line 109  LIMITATIONS
109         tern, is 200.         tern, is 200.
110    
111         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
112         that an integer variable can hold. However, PCRE uses recursion to han-         that an integer variable can hold. However, when using the  traditional
113         dle  subpatterns  and indefinite repetition. This means that the avail-         matching function, PCRE uses recursion to handle subpatterns and indef-
114         able stack space may limit the size of a subject  string  that  can  be         inite repetition.  This means that the available stack space may  limit
115         processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
116    
117    
118  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
119    
120         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
121         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
122         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
123         port for Unicode general category properties was added.         port for Unicode general category properties was added.
124    
125         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
126         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
127         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
128         any  subject  strings  that are matched against it are treated as UTF-8         any subject strings that are matched against it are  treated  as  UTF-8
129         strings instead of just strings of bytes.         strings instead of just strings of bytes.
130    
131         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
132         the  library will be a bit bigger, but the additional run time overhead         the library will be a bit bigger, but the additional run time  overhead
133         is limited to testing the PCRE_UTF8 flag in several places,  so  should         is  limited  to testing the PCRE_UTF8 flag in several places, so should
134         not be very large.         not be very large.
135    
136         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
137         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
138         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
139         general category properties such as Lu for an upper case letter  or  Nd         general  category  properties such as Lu for an upper case letter or Nd
140         for  a decimal number. A full list is given in the pcrepattern documen-         for a decimal number, the Unicode script names such as Arabic  or  Han,
141         tation. The PCRE library is increased in size by about 90K when Unicode         and  the  derived  properties  Any  and L&. A full list is given in the
142         property support is included.         pcrepattern documentation. Only the short names for properties are sup-
143           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
144           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
145           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
146           does not support this.
147    
148         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
149    
150         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
151         subjects are checked for validity on entry to the  relevant  functions.         subjects  are  checked for validity on entry to the relevant functions.
152         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
153         situations, you may already know  that  your  strings  are  valid,  and         situations,  you  may  already  know  that  your strings are valid, and
154         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
155         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
156         PCRE  assumes  that  the  pattern or subject it is given (respectively)         PCRE assumes that the pattern or subject  it  is  given  (respectively)
157         contains only valid UTF-8 codes. In this case, it does not diagnose  an         contains  only valid UTF-8 codes. In this case, it does not diagnose an
158         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
159         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
160         crash.         crash.
161    
162         2. In a pattern, the escape sequence \x{...}, where the contents of the         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
163         braces is a string of hexadecimal digits, is  interpreted  as  a  UTF-8         two-byte UTF-8 character if the value is greater than 127.
        character  whose code number is the given hexadecimal number, for exam-  
        ple: \x{1234}. If a non-hexadecimal digit appears between  the  braces,  
        the item is not recognized.  This escape sequence can be used either as  
        a literal, or within a character class.  
164    
165         3. The original hexadecimal escape sequence, \xhh, matches  a  two-byte         3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-
        UTF-8 character if the value is greater than 127.  
   
        4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-  
166         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
167    
168         5. The dot metacharacter matches one UTF-8 character instead of a  sin-         4. The dot metacharacter matches one UTF-8 character instead of a  sin-
169         gle byte.         gle byte.
170    
171         6.  The  escape sequence \C can be used to match a single byte in UTF-8         5.  The  escape sequence \C can be used to match a single byte in UTF-8
172         mode, but its use can lead to some strange effects.         mode, but its use can lead to some strange effects.  This  facility  is
173           not available in the alternative matching function, pcre_dfa_exec().
174         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly  
175         test  characters of any code value, but the characters that PCRE recog-         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
176         nizes as digits, spaces, or word characters  remain  the  same  set  as         test characters of any code value, but the characters that PCRE  recog-
177           nizes  as  digits,  spaces,  or  word characters remain the same set as
178         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
179         includes Unicode property support, because to do otherwise  would  slow         includes  Unicode  property support, because to do otherwise would slow
180         down  PCRE in many common cases. If you really want to test for a wider         down PCRE in many common cases. If you really want to test for a  wider
181         sense of, say, "digit", you must use Unicode  property  tests  such  as         sense  of,  say,  "digit",  you must use Unicode property tests such as
182         \p{Nd}.         \p{Nd}.
183    
184         8.  Similarly,  characters that match the POSIX named character classes         7. Similarly, characters that match the POSIX named  character  classes
185         are all low-valued characters.         are all low-valued characters.
186    
187         9. Case-insensitive matching applies only to  characters  whose  values         8.  Case-insensitive  matching  applies only to characters whose values
188         are  less than 128, unless PCRE is built with Unicode property support.         are less than 128, unless PCRE is built with Unicode property  support.
189         Even when Unicode property support is available, PCRE  still  uses  its         Even  when  Unicode  property support is available, PCRE still uses its
190         own  character  tables when checking the case of low-valued characters,         own character tables when checking the case of  low-valued  characters,
191         so as not to degrade performance.  The Unicode property information  is         so  as not to degrade performance.  The Unicode property information is
192         used only for characters with higher values.         used only for characters with higher values. Even when Unicode property
193           support is available, PCRE supports case-insensitive matching only when
194           there is a one-to-one mapping between a letter's  cases.  There  are  a
195           small  number  of  many-to-one  mappings in Unicode; these are not sup-
196           ported by PCRE.
197    
198    
199  AUTHOR  AUTHOR
200    
201         Philip Hazel <ph10@cam.ac.uk>         Philip Hazel
202         University Computing Service,         University Computing Service,
203         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
        Phone: +44 1223 334714  
204    
205  Last updated: 09 September 2004         Putting an actual email address here seems to have been a spam  magnet,
206  Copyright (c) 1997-2004 University of Cambridge.         so I've taken it away. If you want to email me, use my initial and sur-
207  -----------------------------------------------------------------------------         name, separated by a dot, at the domain ucs.cam.ac.uk.
208    
209    Last updated: 24 January 2006
210    Copyright (c) 1997-2006 University of Cambridge.
211    ------------------------------------------------------------------------------
212    
 PCRE(3)                                                                PCRE(3)  
213    
214    PCREBUILD(3)                                                      PCREBUILD(3)
215    
216    
217  NAME  NAME
218         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
219    
220    
221  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
222    
223         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
# Line 212  PCRE BUILD-TIME OPTIONS Line 237  PCRE BUILD-TIME OPTIONS
237         not described.         not described.
238    
239    
240    C++ SUPPORT
241    
242           By default, the configure script will search for a C++ compiler and C++
243           header files. If it finds them, it automatically builds the C++ wrapper
244           library for PCRE. You can disable this by adding
245    
246             --disable-cpp
247    
248           to the configure command.
249    
250    
251  UTF-8 SUPPORT  UTF-8 SUPPORT
252    
253         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 character strings, add
# Line 287  POSIX MALLOC USAGE Line 323  POSIX MALLOC USAGE
323  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
324    
325         Internally,  PCRE has a function called match(), which it calls repeat-         Internally,  PCRE has a function called match(), which it calls repeat-
326         edly (possibly recursively) when matching a pattern. By controlling the         edly  (possibly  recursively)  when  matching  a   pattern   with   the
327         maximum  number  of  times  this function may be called during a single         pcre_exec()  function.  By controlling the maximum number of times this
328         matching operation, a limit can be placed on the resources  used  by  a         function may be called during a single matching operation, a limit  can
329         single  call  to  pcre_exec(). The limit can be changed at run time, as         be  placed  on  the resources used by a single call to pcre_exec(). The
330         described in the pcreapi documentation. The default is 10 million,  but         limit can be changed at run time, as described in the pcreapi  documen-
331         this can be changed by adding a setting such as         tation.  The default is 10 million, but this can be changed by adding a
332           setting such as
333    
334           --with-match-limit=500000           --with-match-limit=500000
335    
336         to the configure command.         to  the  configure  command.  This  setting  has  no  effect   on   the
337           pcre_dfa_exec() matching function.
338    
339    
340  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
# Line 324  HANDLING VERY LARGE PATTERNS Line 362  HANDLING VERY LARGE PATTERNS
362    
363  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
364    
365         PCRE  implements  backtracking while matching by making recursive calls         When matching with the pcre_exec() function, PCRE implements backtrack-
366         to an internal function called match(). In environments where the  size         ing by making recursive calls to an internal function  called  match().
367         of the stack is limited, this can severely limit PCRE's operation. (The         In  environments  where  the size of the stack is limited, this can se-
368         Unix environment does not usually suffer from this problem.) An  alter-         verely limit PCRE's operation. (The Unix environment does  not  usually
369         native  approach  that  uses  memory  from  the  heap to remember data,         suffer  from  this  problem.)  An alternative approach that uses memory
370         instead of using recursive function calls, has been implemented to work         from the heap to remember data, instead  of  using  recursive  function
371         round  this  problem. If you want to build a version of PCRE that works         calls,  has been implemented to work round this problem. If you want to
372         this way, add         build a version of PCRE that works this way, add
373    
374           --disable-stack-for-recursion           --disable-stack-for-recursion
375    
# Line 342  AVOIDING EXCESSIVE STACK USAGE Line 380  AVOIDING EXCESSIVE STACK USAGE
380         the blocks are always freed in reverse order. A calling  program  might         the blocks are always freed in reverse order. A calling  program  might
381         be  able  to implement optimized functions that perform better than the         be  able  to implement optimized functions that perform better than the
382         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
383         slowly when built in this way.         slowly when built in this way. This option affects only the pcre_exec()
384           function; it is not relevant for the the pcre_dfa_exec() function.
385    
386    
387  USING EBCDIC CODE  USING EBCDIC CODE
388    
389         PCRE  assumes  by  default that it will run in an environment where the         PCRE assumes by default that it will run in an  environment  where  the
390         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
391         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by
392         adding         adding
393    
394           --enable-ebcdic           --enable-ebcdic
395    
396         to the configure command.         to the configure command.
397    
398  Last updated: 09 September 2004  Last updated: 15 August 2005
399  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
400  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
401    
402    
403    PCREMATCHING(3)                                                PCREMATCHING(3)
404    
405    
406    NAME
407           PCRE - Perl-compatible regular expressions
408    
409    
410    PCRE MATCHING ALGORITHMS
411    
412           This document describes the two different algorithms that are available
413           in PCRE for matching a compiled regular expression against a given sub-
414           ject  string.  The  "standard"  algorithm  is  the  one provided by the
415           pcre_exec() function.  This works in the same was  as  Perl's  matching
416           function, and provides a Perl-compatible matching operation.
417    
418           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
419           this operates in a different way, and is not  Perl-compatible.  It  has
420           advantages  and disadvantages compared with the standard algorithm, and
421           these are described below.
422    
423           When there is only one possible way in which a given subject string can
424           match  a pattern, the two algorithms give the same answer. A difference
425           arises, however, when there are multiple possibilities. For example, if
426           the pattern
427    
428             ^<.*>
429    
430           is matched against the string
431    
432             <something> <something else> <something further>
433    
434           there are three possible answers. The standard algorithm finds only one
435           of them, whereas the DFA algorithm finds all three.
436    
437    
438    REGULAR EXPRESSIONS AS TREES
439    
440           The set of strings that are matched by a regular expression can be rep-
441           resented  as  a  tree structure. An unlimited repetition in the pattern
442           makes the tree of infinite size, but it is still a tree.  Matching  the
443           pattern  to a given subject string (from a given starting point) can be
444           thought of as a search of the tree.  There are  two  standard  ways  to
445           search  a  tree: depth-first and breadth-first, and these correspond to
446           the two matching algorithms provided by PCRE.
447    
448    
449    THE STANDARD MATCHING ALGORITHM
450    
451           In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-
452           sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a
453           depth-first search of the pattern tree. That is, it  proceeds  along  a
454           single path through the tree, checking that the subject matches what is
455           required. When there is a mismatch, the algorithm  tries  any  alterna-
456           tives  at  the  current point, and if they all fail, it backs up to the
457           previous branch point in the  tree,  and  tries  the  next  alternative
458           branch  at  that  level.  This often involves backing up (moving to the
459           left) in the subject string as well.  The  order  in  which  repetition
460           branches  are  tried  is controlled by the greedy or ungreedy nature of
461           the quantifier.
462    
463           If a leaf node is reached, a matching string has  been  found,  and  at
464           that  point the algorithm stops. Thus, if there is more than one possi-
465           ble match, this algorithm returns the first one that it finds.  Whether
466           this  is the shortest, the longest, or some intermediate length depends
467           on the way the greedy and ungreedy repetition quantifiers are specified
468           in the pattern.
469    
470           Because  it  ends  up  with a single path through the tree, it is rela-
471           tively straightforward for this algorithm to keep  track  of  the  sub-
472           strings  that  are  matched  by portions of the pattern in parentheses.
473           This provides support for capturing parentheses and back references.
474    
475    
476    THE DFA MATCHING ALGORITHM
477    
478           DFA stands for "deterministic finite automaton", but you do not need to
479           understand the origins of that name. This algorithm conducts a breadth-
480           first search of the tree. Starting from the first matching point in the
481           subject,  it scans the subject string from left to right, once, charac-
482           ter by character, and as it does  this,  it  remembers  all  the  paths
483           through the tree that represent valid matches.
484    
485           The  scan  continues until either the end of the subject is reached, or
486           there are no more unterminated paths. At this point,  terminated  paths
487           represent  the different matching possibilities (if there are none, the
488           match has failed).  Thus, if there is more  than  one  possible  match,
489           this algorithm finds all of them, and in particular, it finds the long-
490           est. In PCRE, there is an option to stop the algorithm after the  first
491           match (which is necessarily the shortest) has been found.
492    
493           Note that all the matches that are found start at the same point in the
494           subject. If the pattern
495    
496             cat(er(pillar)?)
497    
498           is matched against the string "the caterpillar catchment",  the  result
499           will  be the three strings "cat", "cater", and "caterpillar" that start
500           at the fourth character of the subject. The algorithm does not automat-
501           ically move on to find matches that start at later positions.
502    
503           There are a number of features of PCRE regular expressions that are not
504           supported by the DFA matching algorithm. They are as follows:
505    
506           1. Because the algorithm finds all  possible  matches,  the  greedy  or
507           ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
508           ungreedy quantifiers are treated in exactly the same way.
509    
510           2. When dealing with multiple paths through the tree simultaneously, it
511           is  not  straightforward  to  keep track of captured substrings for the
512           different matching possibilities, and  PCRE's  implementation  of  this
513           algorithm does not attempt to do this. This means that no captured sub-
514           strings are available.
515    
516           3. Because no substrings are captured, back references within the  pat-
517           tern are not supported, and cause errors if encountered.
518    
519           4.  For  the same reason, conditional expressions that use a backrefer-
520           ence as the condition are not supported.
521    
522           5. Callouts are supported, but the value of the  capture_top  field  is
523           always 1, and the value of the capture_last field is always -1.
524    
525           6.  The \C escape sequence, which (in the standard algorithm) matches a
526           single byte, even in UTF-8 mode, is not supported because the DFA algo-
527           rithm moves through the subject string one character at a time, for all
528           active paths through the tree.
529    
530    
531    ADVANTAGES OF THE DFA ALGORITHM
532    
533           Using the DFA matching algorithm provides the following advantages:
534    
535           1. All possible matches (at a single point in the subject) are automat-
536           ically  found,  and  in particular, the longest match is found. To find
537           more than one match using the standard algorithm, you have to do kludgy
538           things with callouts.
539    
540           2.  There is much better support for partial matching. The restrictions
541           on the content of the pattern that apply when using the standard  algo-
542           rithm  for partial matching do not apply to the DFA algorithm. For non-
543           anchored patterns, the starting position of a partial match  is  avail-
544           able.
545    
546           3.  Because  the  DFA algorithm scans the subject string just once, and
547           never needs to backtrack, it is possible  to  pass  very  long  subject
548           strings  to  the matching function in several pieces, checking for par-
549           tial matching each time.
550    
551    
552    DISADVANTAGES OF THE DFA ALGORITHM
553    
554           The DFA algorithm suffers from a number of disadvantages:
555    
556           1. It is substantially slower than  the  standard  algorithm.  This  is
557           partly  because  it has to search for all possible matches, but is also
558           because it is less susceptible to optimization.
559    
560           2. Capturing parentheses and back references are not supported.
561    
562           3. The "atomic group" feature of PCRE regular expressions is supported,
563           but  does not provide the advantage that it does for the standard algo-
564           rithm.
565    
566    Last updated: 28 February 2005
567    Copyright (c) 1997-2005 University of Cambridge.
568    ------------------------------------------------------------------------------
569    
 PCRE(3)                                                                PCRE(3)  
570    
571    PCREAPI(3)                                                          PCREAPI(3)
572    
573    
574  NAME  NAME
575         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
576    
577    
578  PCRE NATIVE API  PCRE NATIVE API
579    
580         #include <pcre.h>         #include <pcre.h>
# Line 375  PCRE NATIVE API Line 583  PCRE NATIVE API
583              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
584              const unsigned char *tableptr);              const unsigned char *tableptr);
585    
586           pcre *pcre_compile2(const char *pattern, int options,
587                int *errorcodeptr,
588                const char **errptr, int *erroffset,
589                const unsigned char *tableptr);
590    
591         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
592              const char **errptr);              const char **errptr);
593    
# Line 382  PCRE NATIVE API Line 595  PCRE NATIVE API
595              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
596              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
597    
598           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
599                const char *subject, int length, int startoffset,
600                int options, int *ovector, int ovecsize,
601                int *workspace, int wscount);
602    
603         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
604              const char *subject, int *ovector,              const char *subject, int *ovector,
605              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 417  PCRE NATIVE API Line 635  PCRE NATIVE API
635    
636         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
637    
638           int pcre_refcount(pcre *code, int adjust);
639    
640         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
641    
642         char *pcre_version(void);         char *pcre_version(void);
# Line 436  PCRE API OVERVIEW Line 656  PCRE API OVERVIEW
656    
657         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
658         is also a set of wrapper functions that correspond to the POSIX regular         is also a set of wrapper functions that correspond to the POSIX regular
659         expression API.  These are described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
660           Both of these APIs define a set of C function calls. A C++  wrapper  is
661           distributed with PCRE. It is documented in the pcrecpp page.
662    
663         The  native  API  function  prototypes  are  defined in the header file         The  native  API  C  function prototypes are defined in the header file
664         pcre.h, and on Unix systems the library itself is  called  libpcre.  It         pcre.h, and on Unix systems the library itself is called  libpcre.   It
665         can normally be accessed by adding -lpcre to the command for linking an         can normally be accessed by adding -lpcre to the command for linking an
666         application  that  uses  PCRE.  The  header  file  defines  the  macros         application  that  uses  PCRE.  The  header  file  defines  the  macros
667         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
668         bers for the library.  Applications can use these  to  include  support         bers for the library.  Applications can use these  to  include  support
669         for different releases of PCRE.         for different releases of PCRE.
670    
671         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
672         for compiling and matching regular expressions. A sample  program  that         pcre_exec() are used for compiling and matching regular expressions  in
673         demonstrates  the  simplest  way  of using them is provided in the file         a  Perl-compatible  manner. A sample program that demonstrates the sim-
674         called pcredemo.c in the source distribution. The pcresample documenta-         plest way of using them is provided in the file  called  pcredemo.c  in
675         tion describes how to run it.         the  source distribution. The pcresample documentation describes how to
676           run it.
677         In  addition  to  the  main compiling and matching functions, there are  
678         convenience functions for extracting captured substrings from a matched         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
679         subject string.  They are:         ble,  is  also provided. This uses a different algorithm for the match-
680           ing. This allows it to find all possible matches (at a given  point  in
681           the  subject),  not  just  one. However, this algorithm does not return
682           captured substrings. A description of the two matching  algorithms  and
683           their  advantages  and disadvantages is given in the pcrematching docu-
684           mentation.
685    
686           In addition to the main compiling and  matching  functions,  there  are
687           convenience functions for extracting captured substrings from a subject
688           string that is matched by pcre_exec(). They are:
689    
690           pcre_copy_substring()           pcre_copy_substring()
691           pcre_copy_named_substring()           pcre_copy_named_substring()
# Line 466  PCRE API OVERVIEW Line 697  PCRE API OVERVIEW
697         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
698         to free the memory used for extracted strings.         to free the memory used for extracted strings.
699    
700         The function pcre_maketables() is used to  build  a  set  of  character         The  function  pcre_maketables()  is  used  to build a set of character
701         tables   in  the  current  locale  for  passing  to  pcre_compile()  or         tables  in  the  current  locale   for   passing   to   pcre_compile(),
702         pcre_exec().  This is an optional facility that is  provided  for  spe-         pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
703         cialist use. Most commonly, no special tables are passed, in which case         provided for specialist use.  Most  commonly,  no  special  tables  are
704         internal tables that are generated when PCRE is built are used.         passed,  in  which case internal tables that are generated when PCRE is
705           built are used.
706    
707         The function pcre_fullinfo() is used to find out  information  about  a         The function pcre_fullinfo() is used to find out  information  about  a
708         compiled  pattern; pcre_info() is an obsolete version that returns only         compiled  pattern; pcre_info() is an obsolete version that returns only
# Line 478  PCRE API OVERVIEW Line 710  PCRE API OVERVIEW
710         patibility.   The function pcre_version() returns a pointer to a string         patibility.   The function pcre_version() returns a pointer to a string
711         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
712    
713           The function pcre_refcount() maintains a  reference  count  in  a  data
714           block  containing  a compiled pattern. This is provided for the benefit
715           of object-oriented applications.
716    
717         The global variables pcre_malloc and pcre_free  initially  contain  the         The global variables pcre_malloc and pcre_free  initially  contain  the
718         entry  points  of  the  standard malloc() and free() functions, respec-         entry  points  of  the  standard malloc() and free() functions, respec-
719         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
# Line 487  PCRE API OVERVIEW Line 723  PCRE API OVERVIEW
723         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
724         indirections  to  memory  management functions. These special functions         indirections  to  memory  management functions. These special functions
725         are used only when PCRE is compiled to use  the  heap  for  remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
726         data,  instead  of recursive function calls. This is a non-standard way         data, instead of recursive function calls, when running the pcre_exec()
727         of building PCRE, for use in environments  that  have  limited  stacks.         function. This is a non-standard way of building PCRE, for use in envi-
728         Because  of  the greater use of memory management, it runs more slowly.         ronments that have limited stacks. Because of the greater use of memory
729         Separate functions are provided so that special-purpose  external  code         management, it runs more slowly.  Separate functions  are  provided  so
730         can be used for this case. When used, these functions are always called         that  special-purpose  external  code  can  be used for this case. When
731         in a stack-like manner (last obtained, first  freed),  and  always  for         used, these functions are always called in a  stack-like  manner  (last
732         memory blocks of the same size.         obtained,  first freed), and always for memory blocks of the same size.
733    
734         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
735         by the caller to a "callout" function, which PCRE  will  then  call  at         by  the  caller  to  a "callout" function, which PCRE will then call at
736         specified  points during a matching operation. Details are given in the         specified points during a matching operation. Details are given in  the
737         pcrecallout documentation.         pcrecallout documentation.
738    
739    
740  MULTITHREADING  MULTITHREADING
741    
742         The PCRE functions can be used in  multi-threading  applications,  with         The  PCRE  functions  can be used in multi-threading applications, with
743         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
744         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
745         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
746    
747         The compiled form of a regular expression is not altered during  match-         The  compiled form of a regular expression is not altered during match-
748         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
749         at once.         at once.
750    
# Line 516  MULTITHREADING Line 752  MULTITHREADING
752  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
753    
754         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
755         later  time,  possibly by a different program, and even on a host other         later time, possibly by a different program, and even on a  host  other
756         than the one on which  it  was  compiled.  Details  are  given  in  the         than  the  one  on  which  it  was  compiled.  Details are given in the
757         pcreprecompile documentation.         pcreprecompile documentation.
758    
759    
# Line 525  CHECKING BUILD-TIME OPTIONS Line 761  CHECKING BUILD-TIME OPTIONS
761    
762         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
763    
764         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
765         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
766         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
767         tures.         tures.
768    
769         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
770         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
771         into which the information is  placed.  The  following  information  is         into  which  the  information  is  placed. The following information is
772         available:         available:
773    
774           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
775    
776         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
777         able; otherwise it is set to zero.         able; otherwise it is set to zero.
778    
779           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
780    
781         The output is an integer that is set to  one  if  support  for  Unicode         The  output  is  an  integer  that is set to one if support for Unicode
782         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
783    
784           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
785    
786         The  output  is an integer that is set to the value of the code that is         The output is an integer that is set to the value of the code  that  is
787         used for the newline character. It is either linefeed (10) or  carriage         used  for the newline character. It is either linefeed (10) or carriage
788         return  (13),  and  should  normally be the standard character for your         return (13), and should normally be the  standard  character  for  your
789         operating system.         operating system.
790    
791           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
792    
793         The output is an integer that contains the number  of  bytes  used  for         The  output  is  an  integer that contains the number of bytes used for
794         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
795         4. Larger values allow larger regular expressions to  be  compiled,  at         4.  Larger  values  allow larger regular expressions to be compiled, at
796         the  expense  of  slower matching. The default value of 2 is sufficient         the expense of slower matching. The default value of  2  is  sufficient
797         for all but the most massive patterns, since  it  allows  the  compiled         for  all  but  the  most massive patterns, since it allows the compiled
798         pattern to be up to 64K in size.         pattern to be up to 64K in size.
799    
800           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
801    
802         The  output  is  an integer that contains the threshold above which the         The output is an integer that contains the threshold  above  which  the
803         POSIX interface uses malloc() for output vectors. Further  details  are         POSIX  interface  uses malloc() for output vectors. Further details are
804         given in the pcreposix documentation.         given in the pcreposix documentation.
805    
806           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
807    
808         The output is an integer that gives the default limit for the number of         The output is an integer that gives the default limit for the number of
809         internal matching function calls in a  pcre_exec()  execution.  Further         internal  matching  function  calls in a pcre_exec() execution. Further
810         details are given with pcre_exec() below.         details are given with pcre_exec() below.
811    
812             PCRE_CONFIG_MATCH_LIMIT_RECURSION
813    
814           The output is an integer that gives the default limit for the depth  of
815           recursion  when calling the internal matching function in a pcre_exec()
816           execution. Further details are given with pcre_exec() below.
817    
818           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
819    
820         The  output  is  an integer that is set to one if internal recursion is         The output is an integer that is set to one if internal recursion  when
821         implemented by recursive function calls that use the stack to  remember         running pcre_exec() is implemented by recursive function calls that use
822         their state. This is the usual way that PCRE is compiled. The output is         the stack to remember their state. This is the usual way that  PCRE  is
823         zero if PCRE was compiled to use blocks of data on the heap instead  of         compiled. The output is zero if PCRE was compiled to use blocks of data
824         recursive   function   calls.   In  this  case,  pcre_stack_malloc  and         on the  heap  instead  of  recursive  function  calls.  In  this  case,
825         pcre_stack_free are called to manage memory blocks on  the  heap,  thus         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
826         avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
827    
828    
829  COMPILING A PATTERN  COMPILING A PATTERN
# Line 590  COMPILING A PATTERN Line 832  COMPILING A PATTERN
832              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
833              const unsigned char *tableptr);              const unsigned char *tableptr);
834    
835         The  function  pcre_compile()  is  called  to compile a pattern into an         pcre *pcre_compile2(const char *pattern, int options,
836         internal form. The pattern is a C string terminated by a  binary  zero,              int *errorcodeptr,
837         and  is  passed in the pattern argument. A pointer to a single block of              const char **errptr, int *erroffset,
838         memory that is obtained via pcre_malloc is returned. This contains  the              const unsigned char *tableptr);
839         compiled  code  and  related  data.  The  pcre  type is defined for the  
840         returned block; this is a typedef for a structure  whose  contents  are         Either of the functions pcre_compile() or pcre_compile2() can be called
841         not  externally defined. It is up to the caller to free the memory when         to compile a pattern into an internal form. The only difference between
842         it is no longer required.         the two interfaces is that pcre_compile2() has an additional  argument,
843           errorcodeptr, via which a numerical error code can be returned.
844    
845           The pattern is a C string terminated by a binary zero, and is passed in
846           the pattern argument. A pointer to a single block  of  memory  that  is
847           obtained  via  pcre_malloc is returned. This contains the compiled code
848           and related data. The pcre type is defined for the returned block; this
849           is a typedef for a structure whose contents are not externally defined.
850           It is up to the caller  to  free  the  memory  when  it  is  no  longer
851           required.
852    
853         Although the compiled code of a PCRE regex is relocatable, that is,  it         Although  the compiled code of a PCRE regex is relocatable, that is, it
854         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
855         fully relocatable, because it may contain a copy of the tableptr  argu-         fully  relocatable, because it may contain a copy of the tableptr argu-
856         ment, which is an address (see below).         ment, which is an address (see below).
857    
858         The options argument contains independent bits that affect the compila-         The options argument contains independent bits that affect the compila-
859         tion. It should be zero if  no  options  are  required.  The  available         tion.  It  should  be  zero  if  no options are required. The available
860         options  are  described  below. Some of them, in particular, those that         options are described below. Some of them, in  particular,  those  that
861         are compatible with Perl, can also be set and  unset  from  within  the         are  compatible  with  Perl,  can also be set and unset from within the
862         pattern  (see  the  detailed  description in the pcrepattern documenta-         pattern (see the detailed description  in  the  pcrepattern  documenta-
863         tion). For these options, the contents of the options  argument  speci-         tion).  For  these options, the contents of the options argument speci-
864         fies  their initial settings at the start of compilation and execution.         fies their initial settings at the start of compilation and  execution.
865         The PCRE_ANCHORED option can be set at the time of matching as well  as         The  PCRE_ANCHORED option can be set at the time of matching as well as
866         at compile time.         at compile time.
867    
868         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
869         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
870         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
871         sage. The offset from the start of the pattern to the  character  where         sage. This is a static string that is part of the library. You must not
872         the  error  was  discovered  is  placed  in  the variable pointed to by         try to free it. The offset from the start of the pattern to the charac-
873         erroffset, which must not be NULL. If it  is,  an  immediate  error  is         ter where the error was discovered is placed in the variable pointed to
874           by erroffset, which must not be NULL. If it is, an immediate  error  is
875         given.         given.
876    
877         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
878         character tables that are  built  when  PCRE  is  compiled,  using  the         codeptr argument is not NULL, a non-zero error code number is  returned
879         default  C  locale.  Otherwise, tableptr must be an address that is the         via  this argument in the event of an error. This is in addition to the
880         result of a call to pcre_maketables(). This value is  stored  with  the         textual error message. Error codes and messages are listed below.
881         compiled  pattern,  and used again by pcre_exec(), unless another table  
882           If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
883           character  tables  that  are  built  when  PCRE  is compiled, using the
884           default C locale. Otherwise, tableptr must be an address  that  is  the
885           result  of  a  call to pcre_maketables(). This value is stored with the
886           compiled pattern, and used again by pcre_exec(), unless  another  table
887         pointer is passed to it. For more discussion, see the section on locale         pointer is passed to it. For more discussion, see the section on locale
888         support below.         support below.
889    
890         This  code  fragment  shows a typical straightforward call to pcre_com-         This code fragment shows a typical straightforward  call  to  pcre_com-
891         pile():         pile():
892    
893           pcre *re;           pcre *re;
# Line 643  COMPILING A PATTERN Line 900  COMPILING A PATTERN
900             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
901             NULL);            /* use default character tables */             NULL);            /* use default character tables */
902    
903         The following names for option bits are defined in  the  pcre.h  header         The  following  names  for option bits are defined in the pcre.h header
904         file:         file:
905    
906           PCRE_ANCHORED           PCRE_ANCHORED
907    
908         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
909         is constrained to match only at the first matching point in the  string         is  constrained to match only at the first matching point in the string
910         that  is being searched (the "subject string"). This effect can also be         that is being searched (the "subject string"). This effect can also  be
911         achieved by appropriate constructs in the pattern itself, which is  the         achieved  by appropriate constructs in the pattern itself, which is the
912         only way to do it in Perl.         only way to do it in Perl.
913    
914           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
915    
916         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
917         all with number 255, before each pattern item. For  discussion  of  the         all  with  number  255, before each pattern item. For discussion of the
918         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
919    
920           PCRE_CASELESS           PCRE_CASELESS
921    
922         If  this  bit is set, letters in the pattern match both upper and lower         If this bit is set, letters in the pattern match both upper  and  lower
923         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
924         changed  within  a  pattern  by  a (?i) option setting. When running in         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
925         UTF-8 mode, case support for high-valued characters is  available  only         always  understands the concept of case for characters whose values are
926         when PCRE is built with Unicode character property support.         less than 128, so caseless matching is always possible. For  characters
927           with  higher  values,  the concept of case is supported if PCRE is com-
928           piled with Unicode property support, but not otherwise. If you want  to
929           use  caseless  matching  for  characters 128 and above, you must ensure
930           that PCRE is compiled with Unicode property support  as  well  as  with
931           UTF-8 support.
932    
933           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
934    
# Line 689  COMPILING A PATTERN Line 951  COMPILING A PATTERN
951           PCRE_EXTENDED           PCRE_EXTENDED
952    
953         If  this  bit  is  set,  whitespace  data characters in the pattern are         If  this  bit  is  set,  whitespace  data characters in the pattern are
954         totally ignored except  when  escaped  or  inside  a  character  class.         totally ignored except when escaped or inside a character class. White-
955         Whitespace  does  not  include the VT character (code 11). In addition,         space does not include the VT character (code 11). In addition, charac-
956         characters between an unescaped # outside a  character  class  and  the         ters between an unescaped # outside a character class and the next new-
957         next newline character, inclusive, are also ignored. This is equivalent         line  character,  inclusive,  are  also  ignored. This is equivalent to
958         to Perl's /x option, and it can be changed within a pattern by  a  (?x)         Perl's /x option, and it can be changed within  a  pattern  by  a  (?x)
959         option setting.         option setting.
960    
961         This  option  makes  it possible to include comments inside complicated         This  option  makes  it possible to include comments inside complicated
# Line 713  COMPILING A PATTERN Line 975  COMPILING A PATTERN
975         literal.  There  are  at  present  no other features controlled by this         literal.  There  are  at  present  no other features controlled by this
976         option. It can also be set by a (?X) option setting within a pattern.         option. It can also be set by a (?X) option setting within a pattern.
977    
978             PCRE_FIRSTLINE
979    
980           If this option is set, an  unanchored  pattern  is  required  to  match
981           before  or at the first newline character in the subject string, though
982           the matched text may continue over the newline.
983    
984           PCRE_MULTILINE           PCRE_MULTILINE
985    
986         By default, PCRE treats the subject string as consisting  of  a  single         By default, PCRE treats the subject string as consisting  of  a  single
# Line 763  COMPILING A PATTERN Line 1031  COMPILING A PATTERN
1031         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
1032         passing an invalid UTF-8 string as a pattern is undefined. It may cause         passing an invalid UTF-8 string as a pattern is undefined. It may cause
1033         your  program  to  crash.   Note that this option can also be passed to         your  program  to  crash.   Note that this option can also be passed to
1034         pcre_exec(),  to  suppress  the  UTF-8  validity  checking  of  subject         pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity  check-
1035         strings.         ing of subject strings.
1036    
1037    
1038    COMPILATION ERROR CODES
1039    
1040           The  following  table  lists  the  error  codes than may be returned by
1041           pcre_compile2(), along with the error messages that may be returned  by
1042           both compiling functions.
1043    
1044              0  no error
1045              1  \ at end of pattern
1046              2  \c at end of pattern
1047              3  unrecognized character follows \
1048              4  numbers out of order in {} quantifier
1049              5  number too big in {} quantifier
1050              6  missing terminating ] for character class
1051              7  invalid escape sequence in character class
1052              8  range out of order in character class
1053              9  nothing to repeat
1054             10  operand of unlimited repeat could match the empty string
1055             11  internal error: unexpected repeat
1056             12  unrecognized character after (?
1057             13  POSIX named classes are supported only within a class
1058             14  missing )
1059             15  reference to non-existent subpattern
1060             16  erroffset passed as NULL
1061             17  unknown option bit(s) set
1062             18  missing ) after comment
1063             19  parentheses nested too deeply
1064             20  regular expression too large
1065             21  failed to get memory
1066             22  unmatched parentheses
1067             23  internal error: code overflow
1068             24  unrecognized character after (?<
1069             25  lookbehind assertion is not fixed length
1070             26  malformed number after (?(
1071             27  conditional group contains more than two branches
1072             28  assertion expected after (?(
1073             29  (?R or (?digits must be followed by )
1074             30  unknown POSIX class name
1075             31  POSIX collating elements are not supported
1076             32  this version of PCRE is not compiled with PCRE_UTF8 support
1077             33  spare error
1078             34  character value in \x{...} sequence is too large
1079             35  invalid condition (?(0)
1080             36  \C not allowed in lookbehind assertion
1081             37  PCRE does not support \L, \l, \N, \U, or \u
1082             38  number after (?C is > 255
1083             39  closing ) for (?C expected
1084             40  recursive call could loop indefinitely
1085             41  unrecognized character after (?P
1086             42  syntax error after (?P
1087             43  two named groups have the same name
1088             44  invalid UTF-8 string
1089             45  support for \P, \p, and \X has not been compiled
1090             46  malformed \P or \p sequence
1091             47  unknown property name after \P or \p
1092    
1093    
1094  STUDYING A PATTERN  STUDYING A PATTERN
1095    
1096         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options
1097              const char **errptr);              const char **errptr);
1098    
1099         If  a  compiled  pattern is going to be used several times, it is worth         If  a  compiled  pattern is going to be used several times, it is worth
# Line 785  STUDYING A PATTERN Line 1109  STUDYING A PATTERN
1109         that  can  be  set  by the caller before the block is passed; these are         that  can  be  set  by the caller before the block is passed; these are
1110         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1111    
1112         If studying the pattern does not produce  any  additional  information,         If studying the pattern does not  produce  any  additional  information
1113         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1114         wants to pass any of the other fields to pcre_exec(), it  must  set  up         wants to pass any of the other fields to pcre_exec(), it  must  set  up
1115         its own pcre_extra block.         its own pcre_extra block.
# Line 795  STUDYING A PATTERN Line 1119  STUDYING A PATTERN
1119    
1120         The third argument for pcre_study() is a pointer for an error  message.         The third argument for pcre_study() is a pointer for an error  message.
1121         If  studying  succeeds  (even  if no data is returned), the variable it         If  studying  succeeds  (even  if no data is returned), the variable it
1122         points to is set to NULL. Otherwise it points to a textual  error  mes-         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1123         sage.  You should therefore test the error pointer for NULL after call-         error message. This is a static string that is part of the library. You
1124         ing pcre_study(), to be sure that it has run successfully.         must not try to free it. You should test the  error  pointer  for  NULL
1125           after calling pcre_study(), to be sure that it has run successfully.
1126    
1127         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1128    
# Line 808  STUDYING A PATTERN Line 1133  STUDYING A PATTERN
1133             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1134    
1135         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1136         that  do not have a single fixed starting character. A bitmap of possi-         that do not have a single fixed starting character. A bitmap of  possi-
1137         ble starting bytes is created.         ble starting bytes is created.
1138    
1139    
1140  LOCALE SUPPORT  LOCALE SUPPORT
1141    
1142         PCRE handles caseless matching, and determines whether  characters  are         PCRE  handles  caseless matching, and determines whether characters are
1143         letters,  digits, or whatever, by reference to a set of tables, indexed         letters digits, or whatever, by reference to a set of  tables,  indexed
1144         by character value. (When running in UTF-8 mode, this applies  only  to         by  character  value.  When running in UTF-8 mode, this applies only to
1145         characters  with  codes  less than 128. Higher-valued codes never match         characters with codes less than 128. Higher-valued  codes  never  match
1146         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1147         with Unicode character property support.)         with Unicode character property support. The use of locales  with  Uni-
1148           code is discouraged.
1149    
1150         An  internal set of tables is created in the default C locale when PCRE         An  internal set of tables is created in the default C locale when PCRE
1151         is built. This is used when the final  argument  of  pcre_compile()  is         is built. This is used when the final  argument  of  pcre_compile()  is
# Line 905  INFORMATION ABOUT A PATTERN Line 1231  INFORMATION ABOUT A PATTERN
1231         Return  the  number of capturing subpatterns in the pattern. The fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
1232         argument should point to an int variable.         argument should point to an int variable.
1233    
1234           PCRE_INFO_DEFAULTTABLES           PCRE_INFO_DEFAULT_TABLES
1235    
1236         Return a pointer to the internal default character tables within  PCRE.         Return a pointer to the internal default character tables within  PCRE.
1237         The  fourth  argument should point to an unsigned char * variable. This         The  fourth  argument should point to an unsigned char * variable. This
# Line 1051  OBSOLETE INFO FUNCTION Line 1377  OBSOLETE INFO FUNCTION
1377         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1378    
1379    
1380  MATCHING A PATTERN  REFERENCE COUNTS
1381    
1382           int pcre_refcount(pcre *code, int adjust);
1383    
1384           The  pcre_refcount()  function is used to maintain a reference count in
1385           the data block that contains a compiled pattern. It is provided for the
1386           benefit  of  applications  that  operate  in an object-oriented manner,
1387           where different parts of the application may be using the same compiled
1388           pattern, but you want to free the block when they are all done.
1389    
1390           When a pattern is compiled, the reference count field is initialized to
1391           zero.  It is changed only by calling this function, whose action is  to
1392           add  the  adjust  value  (which may be positive or negative) to it. The
1393           yield of the function is the new value. However, the value of the count
1394           is  constrained to lie between 0 and 65535, inclusive. If the new value
1395           is outside these limits, it is forced to the appropriate limit value.
1396    
1397           Except when it is zero, the reference count is not correctly  preserved
1398           if  a  pattern  is  compiled on one host and then transferred to a host
1399           whose byte-order is different. (This seems a highly unlikely scenario.)
1400    
1401    
1402    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1403    
1404         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1405              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
# Line 1060  MATCHING A PATTERN Line 1408  MATCHING A PATTERN
1408         The  function pcre_exec() is called to match a subject string against a         The  function pcre_exec() is called to match a subject string against a
1409         compiled pattern, which is passed in the code argument. If the  pattern         compiled pattern, which is passed in the code argument. If the  pattern
1410         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1411         argument.         argument. This function is the main matching facility of  the  library,
1412           and it operates in a Perl-like manner. For specialist use there is also
1413           an alternative matching function, which is described below in the  sec-
1414           tion about the pcre_dfa_exec() function.
1415    
1416         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
1417         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
1418         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1419         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
1420         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1421    
1422         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1080  MATCHING A PATTERN Line 1431  MATCHING A PATTERN
1431             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1432             0,              /* default options */             0,              /* default options */
1433             ovector,        /* vector of integers for substring information */             ovector,        /* vector of integers for substring information */
1434             30);            /* number of elements in the vector  (NOT  size  in             30);            /* number of elements (NOT size in bytes) */
        bytes) */  
1435    
1436     Extra data for pcre_exec()     Extra data for pcre_exec()
1437    
1438         If  the  extra argument is not NULL, it must point to a pcre_extra data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1439         block. The pcre_study() function returns such a block (when it  doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1440         return  NULL), but you can also create one for yourself, and pass addi-         return  NULL), but you can also create one for yourself, and pass addi-
1441         tional information in it. The fields in a pcre_extra block are as  fol-         tional information in it. The pcre_extra block contains  the  following
1442         lows:         fields (not necessarily in this order):
1443    
1444           unsigned long int flags;           unsigned long int flags;
1445           void *study_data;           void *study_data;
1446           unsigned long int match_limit;           unsigned long int match_limit;
1447             unsigned long int match_limit_recursion;
1448           void *callout_data;           void *callout_data;
1449           const unsigned char *tables;           const unsigned char *tables;
1450    
# Line 1102  MATCHING A PATTERN Line 1453  MATCHING A PATTERN
1453    
1454           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1455           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1456             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1457           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1458           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1459    
# Line 1118  MATCHING A PATTERN Line 1470  MATCHING A PATTERN
1470         repeats.         repeats.
1471    
1472         Internally, PCRE uses a function called match() which it calls  repeat-         Internally, PCRE uses a function called match() which it calls  repeat-
1473         edly  (sometimes  recursively).  The  limit is imposed on the number of         edly  (sometimes  recursively). The limit set by match_limit is imposed
1474         times this function is called during a match, which has the  effect  of         on the number of times this function is called during  a  match,  which
1475         limiting  the amount of recursion and backtracking that can take place.         has  the  effect  of  limiting the amount of backtracking that can take
1476         For patterns that are not anchored, the count starts from zero for each         place. For patterns that are not anchored, the count restarts from zero
1477         position in the subject string.         for each position in the subject string.
1478    
1479         The  default  limit  for the library can be set when PCRE is built; the         The  default  value  for  the  limit can be set when PCRE is built; the
1480         default default is 10 million, which handles all but the  most  extreme         default default is 10 million, which handles all but the  most  extreme
1481         cases.  You  can  reduce  the  default  by  suppling pcre_exec() with a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1482         pcre_extra block in which match_limit is set to a  smaller  value,  and         pcre_extra    block    in    which    match_limit    is    set,     and
1483         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1484         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1485    
1486           The match_limit_recursion field is similar to match_limit, but  instead
1487           of limiting the total number of times that match() is called, it limits
1488           the depth of recursion. The recursion depth is a  smaller  number  than
1489           the  total number of calls, because not all calls to match() are recur-
1490           sive.  This limit is of use only if it is set smaller than match_limit.
1491    
1492           Limiting  the  recursion  depth  limits the amount of stack that can be
1493           used, or, when PCRE has been compiled to use memory on the heap instead
1494           of the stack, the amount of heap memory that can be used.
1495    
1496           The  default  value  for  match_limit_recursion can be set when PCRE is
1497           built; the default default  is  the  same  value  as  the  default  for
1498           match_limit.  You can override the default by suppling pcre_exec() with
1499           a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
1500           PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
1501           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1502    
1503         The pcre_callout field is used in conjunction with the  "callout"  fea-         The pcre_callout field is used in conjunction with the  "callout"  fea-
1504         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1505    
# Line 1163  MATCHING A PATTERN Line 1532  MATCHING A PATTERN
1532         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1533         the beginning of a line, so the  circumflex  metacharacter  should  not         the beginning of a line, so the  circumflex  metacharacter  should  not
1534         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1535         causes  circumflex  never  to  match.  This  option  affects  only  the         causes circumflex never to match. This option affects only  the  behav-
1536         behaviour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1537    
1538           PCRE_NOTEOL           PCRE_NOTEOL
1539    
# Line 1373  MATCHING A PATTERN Line 1742  MATCHING A PATTERN
1742    
1743           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1744    
1745         The recursion and backtracking limit, as specified by  the  match_limit         The backtracking limit, as specified by  the  match_limit  field  in  a
1746           pcre_extra  structure  (or  defaulted) was reached. See the description
1747           above.
1748    
1749             PCRE_ERROR_RECURSIONLIMIT (-21)
1750    
1751           The internal recursion limit, as specified by the match_limit_recursion
1752         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
1753         description above.         description above.
1754    
# Line 1394  MATCHING A PATTERN Line 1769  MATCHING A PATTERN
1769         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
1770         ter.         ter.
1771    
1772           PCRE_ERROR_PARTIAL (-12)           PCRE_ERROR_PARTIAL        (-12)
1773    
1774         The  subject  string did not match, but it did match partially. See the         The  subject  string did not match, but it did match partially. See the
1775         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
1776    
1777           PCRE_ERROR_BAD_PARTIAL (-13)           PCRE_ERROR_BADPARTIAL     (-13)
1778    
1779         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
1780         items  that are not supported for partial matching. See the pcrepartial         items  that are not supported for partial matching. See the pcrepartial
1781         documentation for details of partial matching.         documentation for details of partial matching.
1782    
1783           PCRE_ERROR_INTERNAL (-14)           PCRE_ERROR_INTERNAL       (-14)
1784    
1785         An unexpected internal error has occurred. This error could  be  caused         An unexpected internal error has occurred. This error could  be  caused
1786         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
1787    
1788           PCRE_ERROR_BADCOUNT (-15)           PCRE_ERROR_BADCOUNT       (-15)
1789    
1790         This  error is given if the value of the ovecsize argument is negative.         This  error is given if the value of the ovecsize argument is negative.
1791    
# Line 1492  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1867  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1867         pcre_free,  which  of course could be called directly from a C program.         pcre_free,  which  of course could be called directly from a C program.
1868         However, PCRE is used in some situations where it is linked via a  spe-         However, PCRE is used in some situations where it is linked via a  spe-
1869         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  which  cannot  use
1870         pcre_free directly; it is  for  these  cases  that  the  functions  are         pcre_free directly; it is for these cases that the functions  are  pro-
1871         provided.         vided.
1872    
1873    
1874  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
# Line 1514  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1889  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1889         To  extract a substring by name, you first have to find associated num-         To  extract a substring by name, you first have to find associated num-
1890         ber.  For example, for this pattern         ber.  For example, for this pattern
1891    
1892           (a+)b(?<xxx>\d+)...           (a+)b(?P<xxx>\d+)...
1893    
1894         the number of the subpattern called "xxx" is 2. You can find the number         the number of the subpattern called "xxx" is 2. You can find the number
1895         from the name by calling pcre_get_stringnumber(). The first argument is         from the name by calling pcre_get_stringnumber(). The first argument is
# Line 1541  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1916  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1916         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
1917         ate.         ate.
1918    
 Last updated: 09 September 2004  
 Copyright (c) 1997-2004 University of Cambridge.  
 -----------------------------------------------------------------------------  
1919    
1920  PCRE(3)                                                                PCRE(3)  FINDING ALL POSSIBLE MATCHES
1921    
1922           The  traditional  matching  function  uses a similar algorithm to Perl,
1923           which stops when it finds the first match, starting at a given point in
1924           the  subject.  If you want to find all possible matches, or the longest
1925           possible match, consider using the alternative matching  function  (see
1926           below)  instead.  If you cannot use the alternative function, but still
1927           need to find all possible matches, you can kludge it up by  making  use
1928           of the callout facility, which is described in the pcrecallout documen-
1929           tation.
1930    
1931           What you have to do is to insert a callout right at the end of the pat-
1932           tern.   When your callout function is called, extract and save the cur-
1933           rent matched substring. Then return  1,  which  forces  pcre_exec()  to
1934           backtrack  and  try other alternatives. Ultimately, when it runs out of
1935           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
1936    
1937    
1938    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
1939    
1940           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1941                const char *subject, int length, int startoffset,
1942                int options, int *ovector, int ovecsize,
1943                int *workspace, int wscount);
1944    
1945           The function pcre_dfa_exec()  is  called  to  match  a  subject  string
1946           against  a compiled pattern, using a "DFA" matching algorithm. This has
1947           different characteristics to the normal algorithm, and is not  compati-
1948           ble with Perl. Some of the features of PCRE patterns are not supported.
1949           Nevertheless, there are times when this kind of matching can be useful.
1950           For  a  discussion of the two matching algorithms, see the pcrematching
1951           documentation.
1952    
1953           The arguments for the pcre_dfa_exec() function  are  the  same  as  for
1954           pcre_exec(), plus two extras. The ovector argument is used in a differ-
1955           ent way, and this is described below. The other  common  arguments  are
1956           used  in  the  same way as for pcre_exec(), so their description is not
1957           repeated here.
1958    
1959           The two additional arguments provide workspace for  the  function.  The
1960           workspace  vector  should  contain at least 20 elements. It is used for
1961           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
1962           workspace  will  be  needed for patterns and subjects where there are a
1963           lot of possible matches.
1964    
1965           Here is an example of a simple call to pcre_dfa_exec():
1966    
1967             int rc;
1968             int ovector[10];
1969             int wspace[20];
1970             rc = pcre_dfa_exec(
1971               re,             /* result of pcre_compile() */
1972               NULL,           /* we didn't study the pattern */
1973               "some string",  /* the subject string */
1974               11,             /* the length of the subject string */
1975               0,              /* start at offset 0 in the subject */
1976               0,              /* default options */
1977               ovector,        /* vector of integers for substring information */
1978               10,             /* number of elements (NOT size in bytes) */
1979               wspace,         /* working space vector */
1980               20);            /* number of elements (NOT size in bytes) */
1981    
1982       Option bits for pcre_dfa_exec()
1983    
1984           The unused bits of the options argument  for  pcre_dfa_exec()  must  be
1985           zero.  The  only  bits  that may be set are PCRE_ANCHORED, PCRE_NOTBOL,
1986           PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NO_UTF8_CHECK,     PCRE_PARTIAL,
1987           PCRE_DFA_SHORTEST,  and  PCRE_DFA_RESTART.  All  but  the last three of
1988           these are the same as for pcre_exec(),  so  their  description  is  not
1989           repeated here.
1990    
1991             PCRE_PARTIAL
1992    
1993           This  has  the  same general effect as it does for pcre_exec(), but the
1994           details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for
1995           pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into
1996           PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have
1997           been no complete matches, but there is still at least one matching pos-
1998           sibility. The portion of the string that provided the partial match  is
1999           set as the first matching string.
2000    
2001             PCRE_DFA_SHORTEST
2002    
2003           Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
2004           stop as soon as it has found one match. Because  of  the  way  the  DFA
2005           algorithm works, this is necessarily the shortest possible match at the
2006           first possible matching point in the subject string.
2007    
2008             PCRE_DFA_RESTART
2009    
2010           When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and
2011           returns  a  partial  match, it is possible to call it again, with addi-
2012           tional subject characters, and have it continue with  the  same  match.
2013           The  PCRE_DFA_RESTART  option requests this action; when it is set, the
2014           workspace and wscount options must reference the same vector as  before
2015           because  data  about  the  match so far is left in them after a partial
2016           match. There is more discussion of this  facility  in  the  pcrepartial
2017           documentation.
2018    
2019       Successful returns from pcre_dfa_exec()
2020    
2021           When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
2022           string in the subject. Note, however, that all the matches from one run
2023           of  the  function  start  at the same point in the subject. The shorter
2024           matches are all initial substrings of the longer matches. For  example,
2025           if the pattern
2026    
2027             <.*>
2028    
2029           is matched against the string
2030    
2031             This is <something> <something else> <something further> no more
2032    
2033           the three matched strings are
2034    
2035             <something>
2036             <something> <something else>
2037             <something> <something else> <something further>
2038    
2039           On  success,  the  yield of the function is a number greater than zero,
2040           which is the number of matched substrings.  The  substrings  themselves
2041           are  returned  in  ovector. Each string uses two elements; the first is
2042           the offset to the start, and the second is the offset to the  end.  All
2043           the strings have the same start offset. (Space could have been saved by
2044           giving this only once, but it was decided to retain some  compatibility
2045           with  the  way pcre_exec() returns data, even though the meaning of the
2046           strings is different.)
2047    
2048           The strings are returned in reverse order of length; that is, the long-
2049           est  matching  string is given first. If there were too many matches to
2050           fit into ovector, the yield of the function is zero, and the vector  is
2051           filled with the longest matches.
2052    
2053       Error returns from pcre_dfa_exec()
2054    
2055           The  pcre_dfa_exec()  function returns a negative number when it fails.
2056           Many of the errors are the same  as  for  pcre_exec(),  and  these  are
2057           described  above.   There are in addition the following errors that are
2058           specific to pcre_dfa_exec():
2059    
2060             PCRE_ERROR_DFA_UITEM      (-16)
2061    
2062           This return is given if pcre_dfa_exec() encounters an item in the  pat-
2063           tern  that  it  does not support, for instance, the use of \C or a back
2064           reference.
2065    
2066             PCRE_ERROR_DFA_UCOND      (-17)
2067    
2068           This return is given if pcre_dfa_exec() encounters a condition item  in
2069           a  pattern  that  uses  a back reference for the condition. This is not
2070           supported.
2071    
2072             PCRE_ERROR_DFA_UMLIMIT    (-18)
2073    
2074           This return is given if pcre_dfa_exec() is called with an  extra  block
2075           that contains a setting of the match_limit field. This is not supported
2076           (it is meaningless).
2077    
2078             PCRE_ERROR_DFA_WSSIZE     (-19)
2079    
2080           This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
2081           workspace vector.
2082    
2083             PCRE_ERROR_DFA_RECURSE    (-20)
2084    
2085           When  a  recursive subpattern is processed, the matching function calls
2086           itself recursively, using private vectors for  ovector  and  workspace.
2087           This  error  is  given  if  the output vector is not large enough. This
2088           should be extremely rare, as a vector of size 1000 is used.
2089    
2090    Last updated: 18 January 2006
2091    Copyright (c) 1997-2006 University of Cambridge.
2092    ------------------------------------------------------------------------------
2093    
2094    
2095    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2096    
2097    
2098  NAME  NAME
2099         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2100    
2101    
2102  PCRE CALLOUTS  PCRE CALLOUTS
2103    
2104         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
# Line 1606  MISSING CALLOUTS Line 2153  MISSING CALLOUTS
2153  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
2154    
2155         During matching, when PCRE reaches a callout point, the external  func-         During matching, when PCRE reaches a callout point, the external  func-
2156         tion  defined  by pcre_callout is called (if it is set). The only argu-         tion  defined by pcre_callout is called (if it is set). This applies to
2157         ment is a pointer to a pcre_callout block. This structure contains  the         both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The
2158         following fields:         only  argument  to  the callout function is a pointer to a pcre_callout
2159           block. This structure contains the following fields:
2160    
2161           int          version;           int          version;
2162           int          callout_number;           int          callout_number;
# Line 1623  THE CALLOUT INTERFACE Line 2171  THE CALLOUT INTERFACE
2171           int          pattern_position;           int          pattern_position;
2172           int          next_item_length;           int          next_item_length;
2173    
2174         The  version  field  is an integer containing the version number of the         The version field is an integer containing the version  number  of  the
2175         block format. The initial version was 0; the current version is 1.  The         block  format. The initial version was 0; the current version is 1. The
2176         version  number  will  change  again in future if additional fields are         version number will change again in future  if  additional  fields  are
2177         added, but the intention is never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2178    
2179         The  callout_number  field  contains the number of the callout, as com-         The callout_number field contains the number of the  callout,  as  com-
2180         piled into the pattern (that is, the number after ?C for  manual  call-         piled  into  the pattern (that is, the number after ?C for manual call-
2181         outs, and 255 for automatically generated callouts).         outs, and 255 for automatically generated callouts).
2182    
2183         The  offset_vector field is a pointer to the vector of offsets that was         The offset_vector field is a pointer to the vector of offsets that  was
2184         passed by the caller to pcre_exec(). The contents can be  inspected  in         passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2185         order  to extract substrings that have been matched so far, in the same         pcre_exec() is used, the contents can be inspected in order to  extract
2186         way as for extracting substrings after a match has completed.         substrings  that  have  been  matched  so  far,  in the same way as for
2187           extracting substrings after a match has completed. For  pcre_dfa_exec()
2188           this field is not useful.
2189    
2190         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2191         were passed to pcre_exec().         were passed to pcre_exec().
2192    
2193         The  start_match  field contains the offset within the subject at which         The start_match field contains the offset within the subject  at  which
2194         the current match attempt started. If the pattern is not anchored,  the         the  current match attempt started. If the pattern is not anchored, the
2195         callout function may be called several times from the same point in the         callout function may be called several times from the same point in the
2196         pattern for different starting points in the subject.         pattern for different starting points in the subject.
2197    
2198         The current_position field contains the offset within  the  subject  of         The  current_position  field  contains the offset within the subject of
2199         the current match pointer.         the current match pointer.
2200    
2201         The  capture_top field contains one more than the number of the highest         When the pcre_exec() function is used, the capture_top  field  contains
2202         numbered captured substring so far. If no  substrings  have  been  cap-         one  more than the number of the highest numbered captured substring so
2203         tured, the value of capture_top is one.         far. If no substrings have been captured, the value of  capture_top  is
2204           one.  This  is always the case when pcre_dfa_exec() is used, because it
2205         The  capture_last  field  contains the number of the most recently cap-         does not support captured substrings.
2206         tured substring. If no substrings have been captured, its value is  -1.  
2207           The capture_last field contains the number of the  most  recently  cap-
2208         The  callout_data  field contains a value that is passed to pcre_exec()         tured  substring. If no substrings have been captured, its value is -1.
2209         by the caller specifically so that it can be passed back  in  callouts.         This is always the case when pcre_dfa_exec() is used.
2210         It  is  passed  in the pcre_callout field of the pcre_extra data struc-  
2211         ture. If no such data was  passed,  the  value  of  callout_data  in  a         The callout_data field contains a value that is passed  to  pcre_exec()
2212         pcre_callout  block  is  NULL. There is a description of the pcre_extra         or  pcre_dfa_exec() specifically so that it can be passed back in call-
2213           outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
2214           structure.  If  no such data was passed, the value of callout_data in a
2215           pcre_callout block is NULL. There is a description  of  the  pcre_extra
2216         structure in the pcreapi documentation.         structure in the pcreapi documentation.
2217    
2218         The pattern_position field is present from version 1 of the  pcre_call-         The  pattern_position field is present from version 1 of the pcre_call-
2219         out structure. It contains the offset to the next item to be matched in         out structure. It contains the offset to the next item to be matched in
2220         the pattern string.         the pattern string.
2221    
2222         The next_item_length field is present from version 1 of the  pcre_call-         The  next_item_length field is present from version 1 of the pcre_call-
2223         out structure. It contains the length of the next item to be matched in         out structure. It contains the length of the next item to be matched in
2224         the pattern string. When the callout immediately precedes  an  alterna-         the  pattern  string. When the callout immediately precedes an alterna-
2225         tion  bar, a closing parenthesis, or the end of the pattern, the length         tion bar, a closing parenthesis, or the end of the pattern, the  length
2226         is zero. When the callout precedes an opening parenthesis,  the  length         is  zero.  When the callout precedes an opening parenthesis, the length
2227         is that of the entire subpattern.         is that of the entire subpattern.
2228    
2229         The  pattern_position  and next_item_length fields are intended to help         The pattern_position and next_item_length fields are intended  to  help
2230         in distinguishing between different automatic callouts, which all  have         in  distinguishing between different automatic callouts, which all have
2231         the same callout number. However, they are set for all callouts.         the same callout number. However, they are set for all callouts.
2232    
2233    
2234  RETURN VALUES  RETURN VALUES
2235    
2236         The  external callout function returns an integer to PCRE. If the value         The external callout function returns an integer to PCRE. If the  value
2237         is zero, matching proceeds as normal. If  the  value  is  greater  than         is  zero,  matching  proceeds  as  normal. If the value is greater than
2238         zero,  matching  fails  at  the current point, but backtracking to test         zero, matching fails at the current point, but  the  testing  of  other
2239         other matching possibilities goes ahead, just as if a lookahead  asser-         matching possibilities goes ahead, just as if a lookahead assertion had
2240         tion  had  failed.  If  the value is less than zero, the match is aban-         failed. If the value is less than zero, the  match  is  abandoned,  and
2241         doned, and pcre_exec() returns the negative value.         pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2242    
2243         Negative  values  should  normally  be   chosen   from   the   set   of         Negative   values   should   normally   be   chosen  from  the  set  of
2244         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2245         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
2246         reserved  for  use  by callout functions; it will never be used by PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2247         itself.         itself.
2248    
2249  Last updated: 09 September 2004  Last updated: 28 February 2005
2250  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
2251  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
2252    
 PCRE(3)                                                                PCRE(3)  
2253    
2254    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2255    
2256    
2257  NAME  NAME
2258         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2259    
2260    
2261  DIFFERENCES BETWEEN PCRE AND PERL  DIFFERENCES BETWEEN PCRE AND PERL
2262    
2263         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2264         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
2265         respect to Perl 5.8.         respect to Perl 5.8.
2266    
2267         1.  PCRE does not have full UTF-8 support. Details of what it does have         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
2268         are given in the section on UTF-8 support in the main pcre page.         of what it does have are given in the section on UTF-8 support  in  the
2269           main pcre page.
2270    
2271         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2272         permits  them,  but they do not mean what you might think. For example,         permits them, but they do not mean what you might think.  For  example,
2273         (?!a){3} does not assert that the next three characters are not "a". It         (?!a){3} does not assert that the next three characters are not "a". It
2274         just asserts that the next character is not "a" three times.         just asserts that the next character is not "a" three times.
2275    
2276         3.  Capturing  subpatterns  that occur inside negative lookahead asser-         3. Capturing subpatterns that occur inside  negative  lookahead  asser-
2277         tions are counted, but their entries in the offsets  vector  are  never         tions  are  counted,  but their entries in the offsets vector are never
2278         set.  Perl sets its numerical variables from any such patterns that are         set. Perl sets its numerical variables from any such patterns that  are
2279         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
2280         ing),  but  only  if the negative lookahead assertion contains just one         ing), but only if the negative lookahead assertion  contains  just  one
2281         branch.         branch.
2282    
2283         4. Though binary zero characters are supported in the  subject  string,         4.  Though  binary zero characters are supported in the subject string,
2284         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2285         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
2286         the pattern to represent a binary zero.         the pattern to represent a binary zero.
2287    
2288         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5. The following Perl escape sequences are not supported: \l,  \u,  \L,
2289         \U, and \N. In fact these are implemented by Perl's general string-han-         \U, and \N. In fact these are implemented by Perl's general string-han-
2290         dling  and are not part of its pattern matching engine. If any of these         dling and are not part of its pattern matching engine. If any of  these
2291         are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2292    
2293         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE
2294         is  built  with Unicode character property support. The properties that         is built with Unicode character property support. The  properties  that
2295         can be tested with \p and \P are limited to the general category  prop-         can  be tested with \p and \P are limited to the general category prop-
2296         erties such as Lu and Nd.         erties such as Lu and Nd, script names such as Greek or  Han,  and  the
2297           derived properties Any and L&.
2298    
2299         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2300         ters in between are treated as literals.  This  is  slightly  different         ters in between are treated as literals.  This  is  slightly  different
# Line 1808  DIFFERENCES BETWEEN PCRE AND PERL Line 2364  DIFFERENCES BETWEEN PCRE AND PERL
2364         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2365         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2366    
2367  Last updated: 09 September 2004         (n)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2368  Copyright (c) 1997-2004 University of Cambridge.         different way and is not Perl-compatible.
2369  -----------------------------------------------------------------------------  
2370    Last updated: 24 January 2006
2371    Copyright (c) 1997-2006 University of Cambridge.
2372    ------------------------------------------------------------------------------
2373    
 PCRE(3)                                                                PCRE(3)  
2374    
2375    PCREPATTERN(3)                                                  PCREPATTERN(3)
2376    
2377    
2378  NAME  NAME
2379         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2380    
2381    
2382  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2383    
2384         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax  and semantics of the regular expressions supported by PCRE
# Line 1836  PCRE REGULAR EXPRESSION DETAILS Line 2396  PCRE REGULAR EXPRESSION DETAILS
2396         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
2397         page.         page.
2398    
2399           The remainder of this document discusses the  patterns  that  are  sup-
2400           ported  by  PCRE when its main matching function, pcre_exec(), is used.
2401           From  release  6.0,   PCRE   offers   a   second   matching   function,
2402           pcre_dfa_exec(),  which matches using a different algorithm that is not
2403           Perl-compatible. The advantages and disadvantages  of  the  alternative
2404           function, and how it differs from the normal function, are discussed in
2405           the pcrematching page.
2406    
2407         A regular expression is a pattern that is  matched  against  a  subject         A regular expression is a pattern that is  matched  against  a  subject
2408         string  from  left  to right. Most characters stand for themselves in a         string  from  left  to right. Most characters stand for themselves in a
2409         pattern, and match the corresponding characters in the  subject.  As  a         pattern, and match the corresponding characters in the  subject.  As  a
# Line 1843  PCRE REGULAR EXPRESSION DETAILS Line 2411  PCRE REGULAR EXPRESSION DETAILS
2411    
2412           The quick brown fox           The quick brown fox
2413    
2414         matches  a portion of a subject string that is identical to itself. The         matches a portion of a subject string that is identical to itself. When
2415         power of regular expressions comes from the ability to include alterna-         caseless matching is specified (the PCRE_CASELESS option), letters  are
2416         tives  and repetitions in the pattern. These are encoded in the pattern         matched  independently  of case. In UTF-8 mode, PCRE always understands
2417         by the use of metacharacters, which do not  stand  for  themselves  but         the concept of case for characters whose values are less than  128,  so
2418         instead are interpreted in some special way.         caseless  matching  is always possible. For characters with higher val-
2419           ues, the concept of case is supported if PCRE is compiled with  Unicode
2420         There  are  two different sets of metacharacters: those that are recog-         property  support,  but  not  otherwise.   If  you want to use caseless
2421         nized anywhere in the pattern except within square brackets, and  those         matching for characters 128 and above, you must  ensure  that  PCRE  is
2422         that  are  recognized  in square brackets. Outside square brackets, the         compiled with Unicode property support as well as with UTF-8 support.
2423    
2424           The  power  of  regular  expressions  comes from the ability to include
2425           alternatives and repetitions in the pattern. These are encoded  in  the
2426           pattern by the use of metacharacters, which do not stand for themselves
2427           but instead are interpreted in some special way.
2428    
2429           There are two different sets of metacharacters: those that  are  recog-
2430           nized  anywhere in the pattern except within square brackets, and those
2431           that are recognized in square brackets. Outside  square  brackets,  the
2432         metacharacters are as follows:         metacharacters are as follows:
2433    
2434           \      general escape character with several uses           \      general escape character with several uses
# Line 1870  PCRE REGULAR EXPRESSION DETAILS Line 2447  PCRE REGULAR EXPRESSION DETAILS
2447                  also "possessive quantifier"                  also "possessive quantifier"
2448           {      start min/max quantifier           {      start min/max quantifier
2449    
2450         Part of a pattern that is in square brackets  is  called  a  "character         Part  of  a  pattern  that is in square brackets is called a "character
2451         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2452    
2453           \      general escape character           \      general escape character
# Line 1880  PCRE REGULAR EXPRESSION DETAILS Line 2457  PCRE REGULAR EXPRESSION DETAILS
2457                    syntax)                    syntax)
2458           ]      terminates the character class           ]      terminates the character class
2459    
2460         The  following sections describe the use of each of the metacharacters.         The following sections describe the use of each of the  metacharacters.
2461    
2462    
2463  BACKSLASH  BACKSLASH
2464    
2465         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2466         a  non-alphanumeric  character,  it takes away any special meaning that         a non-alphanumeric character, it takes away any  special  meaning  that
2467         character may have. This  use  of  backslash  as  an  escape  character         character  may  have.  This  use  of  backslash  as an escape character
2468         applies both inside and outside character classes.         applies both inside and outside character classes.
2469    
2470         For  example,  if  you want to match a * character, you write \* in the         For example, if you want to match a * character, you write  \*  in  the
2471         pattern.  This escaping action applies whether  or  not  the  following         pattern.   This  escaping  action  applies whether or not the following
2472         character  would  otherwise be interpreted as a metacharacter, so it is         character would otherwise be interpreted as a metacharacter, so  it  is
2473         always safe to precede a non-alphanumeric  with  backslash  to  specify         always  safe  to  precede  a non-alphanumeric with backslash to specify
2474         that  it stands for itself. In particular, if you want to match a back-         that it stands for itself. In particular, if you want to match a  back-
2475         slash, you write \\.         slash, you write \\.
2476    
2477         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
2478         the  pattern (other than in a character class) and characters between a         the pattern (other than in a character class) and characters between  a
2479         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline character are ignored.
2480         An  escaping backslash can be used to include a whitespace or # charac-         An escaping backslash can be used to include a whitespace or #  charac-
2481         ter as part of the pattern.         ter as part of the pattern.
2482    
2483         If you want to remove the special meaning from a  sequence  of  charac-         If  you  want  to remove the special meaning from a sequence of charac-
2484         ters,  you can do so by putting them between \Q and \E. This is differ-         ters, you can do so by putting them between \Q and \E. This is  differ-
2485         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
2486         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
2487         tion. Note the following examples:         tion. Note the following examples:
2488    
2489           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 1916  BACKSLASH Line 2493  BACKSLASH
2493           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2494           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2495    
2496         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
2497         classes.         classes.
2498    
2499     Non-printing characters     Non-printing characters
2500    
2501         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2502         acters in patterns in a visible manner. There is no restriction on  the         acters  in patterns in a visible manner. There is no restriction on the
2503         appearance  of non-printing characters, apart from the binary zero that         appearance of non-printing characters, apart from the binary zero  that
2504         terminates a pattern, but when a pattern  is  being  prepared  by  text         terminates  a  pattern,  but  when  a pattern is being prepared by text
2505         editing,  it  is  usually  easier  to  use  one of the following escape         editing, it is usually easier  to  use  one  of  the  following  escape
2506         sequences than the binary character it represents:         sequences than the binary character it represents:
2507    
2508           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 1937  BACKSLASH Line 2514  BACKSLASH
2514           \t        tab (hex 09)           \t        tab (hex 09)
2515           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
2516           \xhh      character with hex code hh           \xhh      character with hex code hh
2517           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh..
2518    
2519         The precise effect of \cx is as follows: if x is a lower  case  letter,         The  precise  effect of \cx is as follows: if x is a lower case letter,
2520         it  is converted to upper case. Then bit 6 of the character (hex 40) is         it is converted to upper case. Then bit 6 of the character (hex 40)  is
2521         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
2522         becomes hex 7B.         becomes hex 7B.
2523    
2524         After  \x, from zero to two hexadecimal digits are read (letters can be         After \x, from zero to two hexadecimal digits are read (letters can  be
2525         in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-         in  upper  or  lower case). Any number of hexadecimal digits may appear
2526         its  may  appear between \x{ and }, but the value of the character code         between \x{ and }, but the value of the character  code  must  be  less
2527         must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2528         7FFFFFFF).  If  characters other than hexadecimal digits appear between         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than
2529         \x{ and }, or if there is no terminating }, this form of escape is  not         hexadecimal  digits  appear between \x{ and }, or if there is no termi-
2530         recognized. Instead, the initial \x will be interpreted as a basic hex-         nating }, this form of escape is not recognized.  Instead, the  initial
2531         adecimal escape, with no following digits,  giving  a  character  whose         \x will be interpreted as a basic hexadecimal escape, with no following
2532         value is zero.         digits, giving a character whose value is zero.
2533    
2534         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2535         two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference         two  syntaxes  for  \x. There is no difference in the way they are han-
2536         in  the  way they are handled. For example, \xdc is exactly the same as         dled. For example, \xdc is exactly the same as \x{dc}.
        \x{dc}.  
2537    
2538         After \0 up to two further octal digits are read.  In  both  cases,  if         After \0 up to two further octal digits are read.  In  both  cases,  if
2539         there  are fewer than two digits, just those that are present are used.         there  are fewer than two digits, just those that are present are used.
# Line 2040  BACKSLASH Line 2616  BACKSLASH
2616    
2617         In  UTF-8 mode, characters with values greater than 128 never match \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
2618         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2619         code character property support is available.         code  character  property support is available. The use of locales with
2620           Unicode is discouraged.
2621    
2622     Unicode character properties     Unicode character properties
2623    
2624         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
2625         tional escape sequences to match generic character types are  available         tional  escape  sequences  to  match character properties are available
2626         when UTF-8 mode is selected. They are:         when UTF-8 mode is selected. They are:
2627    
2628          \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
2629          \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
2630          \X       an extended Unicode sequence           \X       an extended Unicode sequence
2631    
2632         The  property  names represented by xx above are limited to the Unicode         The property names represented by xx above are limited to  the  Unicode
2633         general category properties. Each character has exactly one such  prop-         script names, the general category properties, and "Any", which matches
2634         erty,  specified  by  a two-letter abbreviation. For compatibility with         any character (including newline). Other properties such as "InMusical-
2635         Perl, negation can be specified by including a circumflex  between  the         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
2636         opening  brace  and the property name. For example, \p{^Lu} is the same         not match any characters, so always causes a match failure.
2637         as \P{Lu}.  
2638           Sets of Unicode characters are defined as belonging to certain scripts.
2639         If only one letter is specified with \p or  \P,  it  includes  all  the         A  character from one of these sets can be matched using a script name.
2640         properties that start with that letter. In this case, in the absence of         For example:
2641         negation, the curly brackets in the escape sequence are optional; these  
2642         two examples have the same effect:           \p{Greek}
2643             \P{Han}
2644    
2645           Those that are not part of an identified script are lumped together  as
2646           "Common". The current list of scripts is:
2647    
2648           Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
2649           dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
2650           Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
2651           Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
2652           Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
2653           Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
2654           Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
2655           banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
2656           Ugaritic, Yi.
2657    
2658           Each  character has exactly one general category property, specified by
2659           a two-letter abbreviation. For compatibility with Perl, negation can be
2660           specified  by  including a circumflex between the opening brace and the
2661           property name. For example, \p{^Lu} is the same as \P{Lu}.
2662    
2663           If only one letter is specified with \p or \P, it includes all the gen-
2664           eral  category properties that start with that letter. In this case, in
2665           the absence of negation, the curly brackets in the escape sequence  are
2666           optional; these two examples have the same effect:
2667    
2668           \p{L}           \p{L}
2669           \pL           \pL
2670    
2671         The following property codes are supported:         The following general category property codes are supported:
2672    
2673           C     Other           C     Other
2674           Cc    Control           Cc    Control
# Line 2113  BACKSLASH Line 2714  BACKSLASH
2714           Zp    Paragraph separator           Zp    Paragraph separator
2715           Zs    Space separator           Zs    Space separator
2716    
2717         Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-         The  special property L& is also supported: it matches a character that
2718         ported by PCRE.         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
2719           classified as a modifier or "other".
2720    
2721           The  long  synonyms  for  these  properties that Perl supports (such as
2722           \p{Letter}) are not supported by PCRE. Nor is is  permitted  to  prefix
2723           any of these properties with "Is".
2724    
2725           No character that is in the Unicode table has the Cn (unassigned) prop-
2726           erty.  Instead, this property is assumed for any code point that is not
2727           in the Unicode table.
2728    
2729         Specifying caseless matching does not affect  these  escape  sequences.         Specifying  caseless  matching  does not affect these escape sequences.
2730         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
2731    
2732         The  \X  escape  matches  any number of Unicode characters that form an         The \X escape matches any number of Unicode  characters  that  form  an
2733         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
2734    
2735           (?>\PM\pM*)           (?>\PM\pM*)
2736    
2737         That is, it matches a character without the "mark"  property,  followed         That  is,  it matches a character without the "mark" property, followed
2738         by  zero  or  more  characters with the "mark" property, and treats the         by zero or more characters with the "mark"  property,  and  treats  the
2739         sequence as an atomic group (see below).  Characters  with  the  "mark"         sequence  as  an  atomic group (see below).  Characters with the "mark"
2740         property are typically accents that affect the preceding character.         property are typically accents that affect the preceding character.
2741    
2742         Matching  characters  by Unicode property is not fast, because PCRE has         Matching characters by Unicode property is not fast, because  PCRE  has
2743         to search a structure that contains  data  for  over  fifteen  thousand         to  search  a  structure  that  contains data for over fifteen thousand
2744         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
2745         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
2746    
2747     Simple assertions     Simple assertions
2748    
2749         The fourth use of backslash is for certain simple assertions. An asser-         The fourth use of backslash is for certain simple assertions. An asser-
2750         tion  specifies a condition that has to be met at a particular point in         tion specifies a condition that has to be met at a particular point  in
2751         a match, without consuming any characters from the subject string.  The         a  match, without consuming any characters from the subject string. The
2752         use  of subpatterns for more complicated assertions is described below.         use of subpatterns for more complicated assertions is described  below.
2753         The backslashed assertions are:         The backslashed assertions are:
2754    
2755           \b     matches at a word boundary           \b     matches at a word boundary
# Line 2149  BACKSLASH Line 2759  BACKSLASH
2759           \z     matches at end of subject           \z     matches at end of subject
2760           \G     matches at first matching position in subject           \G     matches at first matching position in subject
2761    
2762         These assertions may not appear in character classes (but note that  \b         These  assertions may not appear in character classes (but note that \b
2763         has a different meaning, namely the backspace character, inside a char-         has a different meaning, namely the backspace character, inside a char-
2764         acter class).         acter class).
2765    
2766         A word boundary is a position in the subject string where  the  current         A  word  boundary is a position in the subject string where the current
2767         character  and  the previous character do not both match \w or \W (i.e.         character and the previous character do not both match \w or  \W  (i.e.
2768         one matches \w and the other matches \W), or the start or  end  of  the         one  matches  \w  and the other matches \W), or the start or end of the
2769         string if the first or last character matches \w, respectively.         string if the first or last character matches \w, respectively.
2770    
2771         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The \A, \Z, and \z assertions differ from  the  traditional  circumflex
2772         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
2773         at  the  very start and end of the subject string, whatever options are         at the very start and end of the subject string, whatever  options  are
2774         set. Thus, they are independent of multiline mode. These  three  asser-         set.  Thus,  they are independent of multiline mode. These three asser-
2775         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
2776         affect only the behaviour of the circumflex and dollar  metacharacters.         affect  only the behaviour of the circumflex and dollar metacharacters.
2777         However,  if the startoffset argument of pcre_exec() is non-zero, indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
2778         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
2779         the  subject,  \A  can never match. The difference between \Z and \z is         the subject, \A can never match. The difference between \Z  and  \z  is
2780         that \Z matches before a newline that is  the  last  character  of  the         that  \Z  matches  before  a  newline that is the last character of the
2781         string  as well as at the end of the string, whereas \z matches only at         string as well as at the end of the string, whereas \z matches only  at
2782         the end.         the end.
2783    
2784         The \G assertion is true only when the current matching position is  at         The  \G assertion is true only when the current matching position is at
2785         the  start point of the match, as specified by the startoffset argument         the start point of the match, as specified by the startoffset  argument
2786         of pcre_exec(). It differs from \A when the  value  of  startoffset  is         of  pcre_exec().  It  differs  from \A when the value of startoffset is
2787         non-zero.  By calling pcre_exec() multiple times with appropriate argu-         non-zero. By calling pcre_exec() multiple times with appropriate  argu-
2788         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
2789         mentation where \G can be useful.         mentation where \G can be useful.
2790    
2791         Note,  however,  that  PCRE's interpretation of \G, as the start of the         Note, however, that PCRE's interpretation of \G, as the  start  of  the
2792         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
2793         end  of  the  previous  match. In Perl, these can be different when the         end of the previous match. In Perl, these can  be  different  when  the
2794         previously matched string was empty. Because PCRE does just  one  match         previously  matched  string was empty. Because PCRE does just one match
2795         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
2796    
2797         If  all  the alternatives of a pattern begin with \G, the expression is         If all the alternatives of a pattern begin with \G, the  expression  is
2798         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
2799         in the compiled regular expression.         in the compiled regular expression.
2800    
# Line 2192  BACKSLASH Line 2802  BACKSLASH
2802  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
2803    
2804         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
2805         character is an assertion that is true only  if  the  current  matching         character  is  an  assertion  that is true only if the current matching
2806         point  is  at the start of the subject string. If the startoffset argu-         point is at the start of the subject string. If the  startoffset  argu-
2807         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
2808         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
2809         has an entirely different meaning (see below).         has an entirely different meaning (see below).
2810    
2811         Circumflex need not be the first character of the pattern if  a  number         Circumflex  need  not be the first character of the pattern if a number
2812         of  alternatives are involved, but it should be the first thing in each         of alternatives are involved, but it should be the first thing in  each
2813         alternative in which it appears if the pattern is ever  to  match  that         alternative  in  which  it appears if the pattern is ever to match that
2814         branch.  If all possible alternatives start with a circumflex, that is,         branch. If all possible alternatives start with a circumflex, that  is,
2815         if the pattern is constrained to match only at the start  of  the  sub-         if  the  pattern  is constrained to match only at the start of the sub-
2816         ject,  it  is  said  to be an "anchored" pattern. (There are also other         ject, it is said to be an "anchored" pattern.  (There  are  also  other
2817         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
2818    
2819         A dollar character is an assertion that is true  only  if  the  current         A  dollar  character  is  an assertion that is true only if the current
2820         matching  point  is  at  the  end of the subject string, or immediately         matching point is at the end of  the  subject  string,  or  immediately
2821         before a newline character that is the last character in the string (by         before a newline character that is the last character in the string (by
2822         default).  Dollar  need  not  be the last character of the pattern if a         default). Dollar need not be the last character of  the  pattern  if  a
2823         number of alternatives are involved, but it should be the last item  in         number  of alternatives are involved, but it should be the last item in
2824         any  branch  in  which  it appears.  Dollar has no special meaning in a         any branch in which it appears.  Dollar has no  special  meaning  in  a
2825         character class.         character class.
2826    
2827         The meaning of dollar can be changed so that it  matches  only  at  the         The  meaning  of  dollar  can be changed so that it matches only at the
2828         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
2829         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
2830    
2831         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
2832         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When this is the case, they match immedi-
2833         ately after and  immediately  before  an  internal  newline  character,         ately  after  and  immediately  before  an  internal newline character,
2834         respectively,  in addition to matching at the start and end of the sub-         respectively, in addition to matching at the start and end of the  sub-
2835         ject string. For example,  the  pattern  /^abc$/  matches  the  subject         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
2836         string  "def\nabc"  (where \n represents a newline character) in multi-         string "def\nabc" (where \n represents a newline character)  in  multi-
2837         line mode, but not otherwise.  Consequently, patterns that are anchored         line mode, but not otherwise.  Consequently, patterns that are anchored
2838         in  single line mode because all branches start with ^ are not anchored         in single line mode because all branches start with ^ are not  anchored
2839         in multiline mode, and a match for  circumflex  is  possible  when  the         in  multiline  mode,  and  a  match for circumflex is possible when the
2840         startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-         startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-
2841         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
2842    
2843         Note that the sequences \A, \Z, and \z can be used to match  the  start         Note  that  the sequences \A, \Z, and \z can be used to match the start
2844         and  end of the subject in both modes, and if all branches of a pattern         and end of the subject in both modes, and if all branches of a  pattern
2845         start with \A it is always anchored, whether PCRE_MULTILINE is  set  or         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
2846         not.         not.
2847    
2848    
2849  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
2850    
2851         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2852         ter in the subject, including a non-printing  character,  but  not  (by         ter  in  the  subject,  including a non-printing character, but not (by
2853         default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
2854         which might be more than one byte long, except (by default) newline. If         which might be more than one byte long, except (by default) newline. If
2855         the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-         the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-
2856         dling of dot is entirely independent of the handling of circumflex  and         dling  of dot is entirely independent of the handling of circumflex and
2857         dollar,  the  only  relationship  being  that they both involve newline         dollar, the only relationship being  that  they  both  involve  newline
2858         characters. Dot has no special meaning in a character class.         characters. Dot has no special meaning in a character class.
2859    
2860    
2861  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
2862    
2863         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
2864         both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.         both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.
2865         The feature is provided in Perl in order to match individual  bytes  in         The  feature  is provided in Perl in order to match individual bytes in
2866         UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual         UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual
2867         bytes, what remains in the string may be a malformed UTF-8 string.  For         bytes,  what remains in the string may be a malformed UTF-8 string. For
2868         this reason, the \C escape sequence is best avoided.         this reason, the \C escape sequence is best avoided.
2869    
2870         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
2871         below), because in UTF-8 mode this would make it impossible  to  calcu-         below),  because  in UTF-8 mode this would make it impossible to calcu-
2872         late the length of the lookbehind.         late the length of the lookbehind.
2873    
2874    
# Line 2267  SQUARE BRACKETS AND CHARACTER CLASSES Line 2877  SQUARE BRACKETS AND CHARACTER CLASSES
2877         An opening square bracket introduces a character class, terminated by a         An opening square bracket introduces a character class, terminated by a
2878         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
2879         cial. If a closing square bracket is required as a member of the class,         cial. If a closing square bracket is required as a member of the class,
2880         it should be the first data character in the class  (after  an  initial         it  should  be  the first data character in the class (after an initial
2881         circumflex, if present) or escaped with a backslash.         circumflex, if present) or escaped with a backslash.
2882    
2883         A  character  class matches a single character in the subject. In UTF-8         A character class matches a single character in the subject.  In  UTF-8
2884         mode, the character may occupy more than one byte. A matched  character         mode,  the character may occupy more than one byte. A matched character
2885         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
2886         character in the class definition is a circumflex, in  which  case  the         character  in  the  class definition is a circumflex, in which case the
2887         subject  character  must  not  be in the set defined by the class. If a         subject character must not be in the set defined by  the  class.  If  a
2888         circumflex is actually required as a member of the class, ensure it  is         circumflex  is actually required as a member of the class, ensure it is
2889         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
2890    
2891         For  example, the character class [aeiou] matches any lower case vowel,         For example, the character class [aeiou] matches any lower case  vowel,
2892         while [^aeiou] matches any character that is not a  lower  case  vowel.         while  [^aeiou]  matches  any character that is not a lower case vowel.
2893         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
2894         characters that are in the class by enumerating those that are  not.  A         characters  that  are in the class by enumerating those that are not. A
2895         class  that starts with a circumflex is not an assertion: it still con-         class that starts with a circumflex is not an assertion: it still  con-
2896         sumes a character from the subject string, and therefore  it  fails  if         sumes  a  character  from the subject string, and therefore it fails if
2897         the current pointer is at the end of the string.         the current pointer is at the end of the string.
2898    
2899         In  UTF-8 mode, characters with values greater than 255 can be included         In UTF-8 mode, characters with values greater than 255 can be  included
2900         in a class as a literal string of bytes, or by using the  \x{  escaping         in  a  class as a literal string of bytes, or by using the \x{ escaping
2901         mechanism.         mechanism.
2902    
2903         When  caseless  matching  is set, any letters in a class represent both         When caseless matching is set, any letters in a  class  represent  both
2904         their upper case and lower case versions, so for  example,  a  caseless         their  upper  case  and lower case versions, so for example, a caseless
2905         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
2906         match "A", whereas a caseful version would. When running in UTF-8 mode,         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
2907         PCRE  supports  the  concept of case for characters with values greater         understands the concept of case for characters whose  values  are  less
2908         than 128 only when it is compiled with Unicode property support.         than  128, so caseless matching is always possible. For characters with
2909           higher values, the concept of case is supported  if  PCRE  is  compiled
2910           with  Unicode  property support, but not otherwise.  If you want to use
2911           caseless matching for characters 128 and above, you  must  ensure  that
2912           PCRE  is  compiled  with Unicode property support as well as with UTF-8
2913           support.
2914    
2915         The newline character is never treated in any special way in  character         The newline character is never treated in any special way in  character
2916         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
# Line 3088  RECURSIVE PATTERNS Line 3703  RECURSIVE PATTERNS
3703         tion.)  The special item (?R) is a recursive call of the entire regular         tion.)  The special item (?R) is a recursive call of the entire regular
3704         expression.         expression.
3705    
3706         For example, this PCRE pattern solves the  nested  parentheses  problem         A recursive subpattern call is always treated as an atomic group.  That
3707         (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is         is,  once  it  has  matched some of the subject string, it is never re-
3708         ignored):         entered, even if it contains untried alternatives and there is a subse-
3709           quent matching failure.
3710    
3711           This  PCRE  pattern  solves  the nested parentheses problem (assume the
3712           PCRE_EXTENDED option is set so that white space is ignored):
3713    
3714           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
3715    
3716         First it matches an opening parenthesis. Then it matches any number  of         First it matches an opening parenthesis. Then it matches any number  of
3717         substrings  which  can  either  be  a sequence of non-parentheses, or a         substrings  which  can  either  be  a sequence of non-parentheses, or a
3718         recursive match of the pattern itself (that is  a  correctly  parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
3719         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
3720    
3721         If  this  were  part of a larger pattern, you would not want to recurse         If  this  were  part of a larger pattern, you would not want to recurse
# Line 3180  SUBPATTERNS AS SUBROUTINES Line 3799  SUBPATTERNS AS SUBROUTINES
3799         two strings. Such references must, however, follow  the  subpattern  to         two strings. Such references must, however, follow  the  subpattern  to
3800         which they refer.         which they refer.
3801    
3802           Like recursive subpatterns, a "subroutine" call is always treated as an
3803           atomic group. That is, once it has matched some of the subject  string,
3804           it  is  never  re-entered, even if it contains untried alternatives and
3805           there is a subsequent matching failure.
3806    
3807    
3808  CALLOUTS  CALLOUTS
3809    
3810         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
3811         Perl code to be obeyed in the middle of matching a regular  expression.         Perl  code to be obeyed in the middle of matching a regular expression.
3812         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
3813         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
3814         tion.         tion.
3815    
3816         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
3817         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
3818         an  external function by putting its entry point in the global variable         an external function by putting its entry point in the global  variable
3819         pcre_callout.  By default, this variable contains NULL, which  disables         pcre_callout.   By default, this variable contains NULL, which disables
3820         all calling out.         all calling out.
3821    
3822         Within  a  regular  expression,  (?C) indicates the points at which the         Within a regular expression, (?C) indicates the  points  at  which  the
3823         external function is to be called. If you want  to  identify  different         external  function  is  to be called. If you want to identify different
3824         callout  points, you can put a number less than 256 after the letter C.         callout points, you can put a number less than 256 after the letter  C.
3825         The default value is zero.  For example, this pattern has  two  callout         The  default  value is zero.  For example, this pattern has two callout
3826         points:         points:
3827    
3828           (?C1)abc(?C2)def           (?C1)abc(?C2)def
3829    
3830         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
3831         automatically installed before each item in the pattern. They  are  all         automatically  installed  before each item in the pattern. They are all
3832         numbered 255.         numbered 255.
3833    
3834         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
3835         set), the external function is called. It is provided with  the  number         set),  the  external function is called. It is provided with the number
3836         of  the callout, the position in the pattern, and, optionally, one item         of the callout, the position in the pattern, and, optionally, one  item
3837         of data originally supplied by the caller of pcre_exec().  The  callout         of  data  originally supplied by the caller of pcre_exec(). The callout
3838         function  may cause matching to proceed, to backtrack, or to fail alto-         function may cause matching to proceed, to backtrack, or to fail  alto-
3839         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
3840         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
3841    
3842  Last updated: 09 September 2004  Last updated: 24 January 2006
3843  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
3844  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
3845    
 PCRE(3)                                                                PCRE(3)  
3846    
3847    PCREPARTIAL(3)                                                  PCREPARTIAL(3)
3848    
3849    
3850  NAME  NAME
3851         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
3852    
3853    
3854  PARTIAL MATCHING IN PCRE  PARTIAL MATCHING IN PCRE
3855    
3856         In  normal  use  of  PCRE,  if  the  subject  string  that is passed to         In  normal  use  of  PCRE,  if  the  subject  string  that is passed to
3857         pcre_exec() matches as far as it goes, but is too short  to  match  the         pcre_exec() or pcre_dfa_exec() matches as far as it goes,  but  is  too
3858         entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances         short  to  match  the  entire  pattern, PCRE_ERROR_NOMATCH is returned.
3859         where it might be helpful to distinguish this case from other cases  in         There are circumstances where it might be helpful to  distinguish  this
3860         which there is no match.         case from other cases in which there is no match.
3861    
3862         Consider, for example, an application where a human is required to type         Consider, for example, an application where a human is required to type
3863         in data for a field with specific formatting requirements.  An  example         in data for a field with specific formatting requirements.  An  example
# Line 3248  PARTIAL MATCHING IN PCRE Line 3873  PARTIAL MATCHING IN PCRE
3873         until the entire string has been entered.         until the entire string has been entered.
3874    
3875         PCRE supports the concept of partial matching by means of the PCRE_PAR-         PCRE supports the concept of partial matching by means of the PCRE_PAR-
3876         TIAL  option,  which  can be set when calling pcre_exec(). When this is         TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or
3877         done,  the   return   code   PCRE_ERROR_NOMATCH   is   converted   into         pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
3878         PCRE_ERROR_PARTIAL  if  at  any  time  during  the matching process the         PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time
3879         entire subject string matched part of the pattern. No captured data  is         during the matching process the last part of the subject string matched
3880         set when this occurs.         part  of  the  pattern. Unfortunately, for non-anchored matching, it is
3881           not possible to obtain the position of the start of the partial  match.
3882           No captured data is set when PCRE_ERROR_PARTIAL is returned.
3883    
3884           When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code
3885           PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of
3886           the  subject is reached, there have been no complete matches, but there
3887           is still at least one matching possibility. The portion of  the  string
3888           that provided the partial match is set as the first matching string.
3889    
3890         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
3891         the last literal byte in a pattern, and abandons  matching  immediately         the last literal byte in a pattern, and abandons  matching  immediately
# Line 3263  PARTIAL MATCHING IN PCRE Line 3896  PARTIAL MATCHING IN PCRE
3896  RESTRICTED PATTERNS FOR PCRE_PARTIAL  RESTRICTED PATTERNS FOR PCRE_PARTIAL
3897    
3898         Because of the way certain internal optimizations  are  implemented  in         Because of the way certain internal optimizations  are  implemented  in
3899         PCRE,  the  PCRE_PARTIAL  option  cannot  be  used  with  all patterns.         the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with
3900         Repeated single characters such as         all patterns. These restrictions do not apply when  pcre_dfa_exec()  is
3901           used.  For pcre_exec(), repeated single characters such as
3902    
3903           a{2,4}           a{2,4}
3904    
# Line 3272  RESTRICTED PATTERNS FOR PCRE_PARTIAL Line 3906  RESTRICTED PATTERNS FOR PCRE_PARTIAL
3906    
3907           \d+           \d+
3908    
3909         are not permitted if the maximum number of occurrences is greater  than         are  not permitted if the maximum number of occurrences is greater than
3910         one.  Optional items such as \d? (where the maximum is one) are permit-         one.  Optional items such as \d? (where the maximum is one) are permit-
3911         ted.  Quantifiers with any values are permitted after  parentheses,  so         ted.   Quantifiers  with any values are permitted after parentheses, so
3912         the invalid examples above can be coded thus:         the invalid examples above can be coded thus:
3913    
3914           (a){2,4}           (a){2,4}
3915           (\d)+           (\d)+
3916    
3917         These  constructions  run more slowly, but for the kinds of application         These constructions run more slowly, but for the kinds  of  application
3918         that are envisaged for this facility, this is not felt to  be  a  major         that  are  envisaged  for this facility, this is not felt to be a major
3919         restriction.         restriction.
3920    
3921         If  PCRE_PARTIAL  is  set  for  a  pattern that does not conform to the         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
3922         restrictions, pcre_exec() returns the error code  PCRE_ERROR_BADPARTIAL         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
3923         (-13).         (-13).
3924    
3925    
3926  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
3927    
3928         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If the escape sequence \P is present  in  a  pcretest  data  line,  the
3929         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
3930         uses the date example quoted above:         uses the date example quoted above:
3931    
3932             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
3933           data> 25jun04P           data> 25jun04\P
3934            0: 25jun04            0: 25jun04
3935            1: jun            1: jun
3936           data> 25dec3P           data> 25dec3\P
3937           Partial match           Partial match
3938           data> 3juP           data> 3ju\P
3939           Partial match           Partial match
3940           data> 3jujP           data> 3juj\P
3941           No match           No match
3942           data> jP           data> j\P
3943           No match           No match
3944    
3945         The  first  data  string  is  matched completely, so pcretest shows the         The first data string is matched  completely,  so  pcretest  shows  the
3946         matched substrings. The remaining four strings do not  match  the  com-         matched  substrings.  The  remaining four strings do not match the com-
3947         plete pattern, but the first two are partial matches.         plete pattern, but the first two are partial matches.  The  same  test,
3948           using  DFA  matching (by means of the \D escape sequence), produces the
3949           following output:
3950    
3951  Last updated: 08 September 2004             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
3952  Copyright (c) 1997-2004 University of Cambridge.           data> 25jun04\P\D
3953  -----------------------------------------------------------------------------            0: 25jun04
3954             data> 23dec3\P\D
3955             Partial match: 23dec3
3956             data> 3ju\P\D
3957             Partial match: 3ju
3958             data> 3juj\P\D
3959             No match
3960             data> j\P\D
3961             No match
3962    
3963  PCRE(3)                                                                PCRE(3)         Notice that in this case the portion of the string that was matched  is
3964           made available.
3965    
3966    
3967    MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
3968    
3969           When a partial match has been found using pcre_dfa_exec(), it is possi-
3970           ble to continue the match by  providing  additional  subject  data  and
3971           calling  pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the
3972           same working space (where details of the  previous  partial  match  are
3973           stored).  Here  is  an  example  using  pcretest,  where  the \R escape
3974           sequence sets the PCRE_DFA_RESTART option and the  \D  escape  sequence
3975           requests the use of pcre_dfa_exec():
3976    
3977               re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
3978             data> 23ja\P\D
3979             Partial match: 23ja
3980             data> n05\R\D
3981              0: n05
3982    
3983           The  first  call has "23ja" as the subject, and requests partial match-
3984           ing; the second call  has  "n05"  as  the  subject  for  the  continued
3985           (restarted)  match.   Notice  that when the match is complete, only the
3986           last part is shown; PCRE does  not  retain  the  previously  partially-
3987           matched  string. It is up to the calling program to do that if it needs
3988           to.
3989    
3990           This facility can  be  used  to  pass  very  long  subject  strings  to
3991           pcre_dfa_exec(). However, some care is needed for certain types of pat-
3992           tern.
3993    
3994           1. If the pattern contains tests for the beginning or end  of  a  line,
3995           you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
3996           ate, when the subject string for any call does not contain  the  begin-
3997           ning or end of a line.
3998    
3999           2.  If  the  pattern contains backward assertions (including \b or \B),
4000           you need to arrange for some overlap in the subject  strings  to  allow
4001           for  this.  For example, you could pass the subject in chunks that were
4002           500 bytes long, but in a buffer of 700 bytes, with the starting  offset
4003           set to 200 and the previous 200 bytes at the start of the buffer.
4004    
4005           3.  Matching a subject string that is split into multiple segments does
4006           not always produce exactly the same result as matching over one  single
4007           long  string.   The  difference arises when there are multiple matching
4008           possibilities, because a partial match result is given only when  there
4009           are  no  completed  matches  in a call to fBpcre_dfa_exec(). This means
4010           that as soon as the shortest match has been found,  continuation  to  a
4011           new  subject  segment  is  no  longer possible.  Consider this pcretest
4012           example:
4013    
4014               re> /dog(sbody)?/
4015             data> do\P\D
4016             Partial match: do
4017             data> gsb\R\P\D
4018              0: g
4019             data> dogsbody\D
4020              0: dogsbody
4021              1: dog
4022    
4023           The pattern matches the words "dog" or "dogsbody". When the subject  is
4024           presented  in  several  parts  ("do" and "gsb" being the first two) the
4025           match stops when "dog" has been found, and it is not possible  to  con-
4026           tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single
4027           string, both matches are found.
4028    
4029           Because of this phenomenon, it does not usually make  sense  to  end  a
4030           pattern that is going to be matched in this way with a variable repeat.
4031    
4032           4. Patterns that contain alternatives at the top level which do not all
4033           start with the same pattern item may not work as expected. For example,
4034           consider this pattern:
4035    
4036             1234|3789
4037    
4038           If the first part of the subject is "ABC123", a partial  match  of  the
4039           first  alternative  is found at offset 3. There is no partial match for
4040           the second alternative, because such a match does not start at the same
4041           point  in  the  subject  string. Attempting to continue with the string
4042           "789" does not yield a match because only those alternatives that match
4043           at  one point in the subject are remembered. The problem arises because
4044           the start of the second alternative matches within the  first  alterna-
4045           tive. There is no problem with anchored patterns or patterns such as:
4046    
4047             1234|ABCD
4048    
4049           where no string can be a partial match for both alternatives.
4050    
4051    Last updated: 16 January 2006
4052    Copyright (c) 1997-2006 University of Cambridge.
4053    ------------------------------------------------------------------------------
4054    
4055    
4056    PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
4057    
4058    
4059  NAME  NAME
4060         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4061    
4062    
4063  SAVING AND RE-USING PRECOMPILED PCRE PATTERNS  SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
4064    
4065         If  you  are running an application that uses a large number of regular         If  you  are running an application that uses a large number of regular
# Line 3391  SAVING A COMPILED PATTERN Line 4128  SAVING A COMPILED PATTERN
4128  RE-USING A PRECOMPILED PATTERN  RE-USING A PRECOMPILED PATTERN
4129    
4130         Re-using a precompiled pattern is straightforward. Having  reloaded  it         Re-using a precompiled pattern is straightforward. Having  reloaded  it
4131         into main memory, you pass its pointer to pcre_exec() in the usual way.         into   main   memory,   you   pass   its   pointer  to  pcre_exec()  or
4132         This should work even on another host, and even if that  host  has  the         pcre_dfa_exec() in the usual way. This  should  work  even  on  another
4133         opposite endianness to the one where the pattern was compiled.         host,  and  even  if  that  host has the opposite endianness to the one
4134           where the pattern was compiled.
4135         However,  if  you  passed a pointer to custom character tables when the  
4136         pattern was compiled (the tableptr  argument  of  pcre_compile()),  you         However, if you passed a pointer to custom character  tables  when  the
4137         must now pass a similar pointer to pcre_exec(), because the value saved         pattern  was  compiled  (the  tableptr argument of pcre_compile()), you
4138         with the compiled pattern will obviously be  nonsense.  A  field  in  a         must now pass a similar  pointer  to  pcre_exec()  or  pcre_dfa_exec(),
4139         pcre_extra()  block is used to pass this data, as described in the sec-         because  the  value  saved  with the compiled pattern will obviously be
4140         tion on matching a pattern in the pcreapi documentation.         nonsense. A field in a pcre_extra() block is used to pass this data, as
4141           described  in the section on matching a pattern in the pcreapi documen-
4142           tation.
4143    
4144         If you did not provide custom character tables  when  the  pattern  was         If you did not provide custom character tables  when  the  pattern  was
4145         compiled,  the  pointer  in  the compiled pattern is NULL, which causes         compiled,  the  pointer  in  the compiled pattern is NULL, which causes
# Line 3411  RE-USING A PRECOMPILED PATTERN Line 4150  RE-USING A PRECOMPILED PATTERN
4150         your own pcre_extra data block and set the study_data field to point to         your own pcre_extra data block and set the study_data field to point to
4151         the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA         the  reloaded  study  data. You must also set the PCRE_EXTRA_STUDY_DATA
4152         bit in the flags field to indicate that study  data  is  present.  Then         bit in the flags field to indicate that study  data  is  present.  Then
4153         pass the pcre_extra block to pcre_exec() in the usual way.         pass  the  pcre_extra  block  to  pcre_exec() or pcre_dfa_exec() in the
4154           usual way.
4155    
4156    
4157  COMPATIBILITY WITH DIFFERENT PCRE RELEASES  COMPATIBILITY WITH DIFFERENT PCRE RELEASES
4158    
4159         The  layout  of the control block that is at the start of the data that         The layout of the control block that is at the start of the  data  that
4160         makes up a compiled pattern was changed for release 5.0.  If  you  have         makes  up  a  compiled pattern was changed for release 5.0. If you have
4161         any  saved  patterns  that  were compiled with previous releases (not a         any saved patterns that were compiled with  previous  releases  (not  a
4162         facility that was previously advertised), you will  have  to  recompile         facility  that  was  previously advertised), you will have to recompile
4163         them  for  release  5.0. However, from now on, it should be possible to         them for release 5.0. However, from now on, it should  be  possible  to
4164         make changes in a compabible manner.         make changes in a compatible manner.
4165    
4166  Last updated: 10 September 2004         Notwithstanding the above, if you have any saved patterns in UTF-8 mode
4167  Copyright (c) 1997-2004 University of Cambridge.         that use \p or \P that were compiled with any release up to and includ-
4168  -----------------------------------------------------------------------------         ing 6.4, you will have to recompile them for release 6.5 and above.
4169    
4170    Last updated: 01 February 2006
4171    Copyright (c) 1997-2006 University of Cambridge.
4172    ------------------------------------------------------------------------------
4173    
 PCRE(3)                                                                PCRE(3)  
4174    
4175    PCREPERFORM(3)                                                  PCREPERFORM(3)
4176    
4177    
4178  NAME  NAME
4179         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4180    
4181    
4182  PCRE PERFORMANCE  PCRE PERFORMANCE
4183    
4184         Certain  items  that may appear in regular expression patterns are more         Certain  items  that may appear in regular expression patterns are more
# Line 3469  PCRE PERFORMANCE Line 4214  PCRE PERFORMANCE
4214    
4215         If you are using such a pattern with subject strings that do  not  con-         If you are using such a pattern with subject strings that do  not  con-
4216         tain newlines, the best performance is obtained by setting PCRE_DOTALL,         tain newlines, the best performance is obtained by setting PCRE_DOTALL,
4217         or starting the pattern with ^.* to indicate explicit  anchoring.  That         or starting the pattern with ^.* or ^.*? to indicate  explicit  anchor-
4218         saves  PCRE from having to scan along the subject looking for a newline         ing.  That saves PCRE from having to scan along the subject looking for
4219         to restart at.         a newline to restart at.
4220    
4221         Beware of patterns that contain nested indefinite  repeats.  These  can         Beware of patterns that contain nested indefinite  repeats.  These  can
4222         take  a  long time to run when applied to a string that does not match.         take  a  long time to run when applied to a string that does not match.
# Line 3492  PCRE PERFORMANCE Line 4237  PCRE PERFORMANCE
4237           (a+)*b           (a+)*b
4238    
4239         where a literal character follows. Before  embarking  on  the  standard         where a literal character follows. Before  embarking  on  the  standard
4240         matching  procedure,  PCRE  checks  that  there  is  a "b" later in the         matching  procedure,  PCRE checks that there is a "b" later in the sub-
4241         subject string, and if there is not, it fails  the  match  immediately.         ject string, and if there is not, it fails the match immediately.  How-
4242         However, when there is no following literal this optimization cannot be         ever,  when  there  is no following literal this optimization cannot be
4243         used. You can see the difference by comparing the behaviour of         used. You can see the difference by comparing the behaviour of
4244    
4245           (a+)*\d           (a+)*\d
# Line 3506  PCRE PERFORMANCE Line 4251  PCRE PERFORMANCE
4251         In many cases, the solution to this kind of performance issue is to use         In many cases, the solution to this kind of performance issue is to use
4252         an atomic group or a possessive quantifier.         an atomic group or a possessive quantifier.
4253    
4254  Last updated: 09 September 2004  Last updated: 28 February 2005
4255  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2005 University of Cambridge.
4256  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
4257    
 PCRE(3)                                                                PCRE(3)  
4258    
4259    PCREPOSIX(3)                                                      PCREPOSIX(3)
4260    
4261    
4262  NAME  NAME
4263         PCRE - Perl-compatible regular expressions.         PCRE - Perl-compatible regular expressions.
4264    
4265    
4266  SYNOPSIS OF POSIX API  SYNOPSIS OF POSIX API
4267    
4268         #include <pcreposix.h>         #include <pcreposix.h>
# Line 3537  DESCRIPTION Line 4283  DESCRIPTION
4283    
4284         This  set  of  functions provides a POSIX-style API to the PCRE regular         This  set  of  functions provides a POSIX-style API to the PCRE regular
4285         expression package. See the pcreapi documentation for a description  of         expression package. See the pcreapi documentation for a description  of
4286         PCRE's native API, which contains additional functionality.         PCRE's native API, which contains much additional functionality.
4287    
4288         The functions described here are just wrapper functions that ultimately         The functions described here are just wrapper functions that ultimately
4289         call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the         call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
# Line 3547  DESCRIPTION Line 4293  DESCRIPTION
4293         functions call the native ones, it is also necessary to add -lpcre.         functions call the native ones, it is also necessary to add -lpcre.
4294    
4295         I have implemented only those option bits that can be reasonably mapped         I have implemented only those option bits that can be reasonably mapped
4296         to  PCRE  native  options.  In  addition,  the options REG_EXTENDED and         to PCRE native options. In addition, the option REG_EXTENDED is defined
4297         REG_NOSUB are defined with the value zero. They  have  no  effect,  but         with the value zero. This has no effect, but since  programs  that  are
4298         since  programs that are written to the POSIX interface often use them,         written  to  the  POSIX interface often use it, this makes it easier to
4299         this makes it easier to slot in PCRE as a  replacement  library.  Other         slot in PCRE as a replacement library. Other POSIX options are not even
4300         POSIX options are not even defined.         defined.
4301    
4302         When  PCRE  is  called  via these functions, it is only the API that is         When  PCRE  is  called  via these functions, it is only the API that is
4303         POSIX-like in style. The syntax and semantics of  the  regular  expres-         POSIX-like in style. The syntax and semantics of  the  regular  expres-
# Line 3576  COMPILING A PATTERN Line 4322  COMPILING A PATTERN
4322         form. The pattern is a C string terminated by a  binary  zero,  and  is         form. The pattern is a C string terminated by a  binary  zero,  and  is
4323         passed  in  the  argument  pattern. The preg argument is a pointer to a         passed  in  the  argument  pattern. The preg argument is a pointer to a
4324         regex_t structure that is used as a base for storing information  about         regex_t structure that is used as a base for storing information  about
4325         the compiled expression.         the compiled regular expression.
4326    
4327         The argument cflags is either zero, or contains one or more of the bits         The argument cflags is either zero, or contains one or more of the bits
4328         defined by the following macros:         defined by the following macros:
4329    
4330             REG_DOTALL
4331    
4332           The PCRE_DOTALL option is set when the regular expression is passed for
4333           compilation to the native function. Note that REG_DOTALL is not part of
4334           the POSIX standard.
4335    
4336           REG_ICASE           REG_ICASE
4337    
4338         The PCRE_CASELESS option is set when the expression is passed for  com-         The PCRE_CASELESS option is set when the regular expression  is  passed
4339         pilation to the native function.         for compilation to the native function.
4340    
4341           REG_NEWLINE           REG_NEWLINE
4342    
4343         The PCRE_MULTILINE option is set when the expression is passed for com-         The  PCRE_MULTILINE option is set when the regular expression is passed
4344         pilation to the native function. Note that  this  does  not  mimic  the         for compilation to the native function. Note that this does  not  mimic
4345         defined POSIX behaviour for REG_NEWLINE (see the following section).         the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
4346           tion).
4347    
4348             REG_NOSUB
4349    
4350           The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
4351           passed for compilation to the native function. In addition, when a pat-
4352           tern that is compiled with this flag is passed to regexec() for  match-
4353           ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
4354           strings are returned.
4355    
4356             REG_UTF8
4357    
4358           The PCRE_UTF8 option is set when the regular expression is  passed  for
4359           compilation  to the native function. This causes the pattern itself and
4360           all data strings used for matching it to be treated as  UTF-8  strings.
4361           Note that REG_UTF8 is not part of the POSIX standard.
4362    
4363         In  the  absence  of  these  flags, no options are passed to the native         In  the  absence  of  these  flags, no options are passed to the native
4364         function.  This means the the  regex  is  compiled  with  PCRE  default         function.  This means the the  regex  is  compiled  with  PCRE  default
# Line 3657  MATCHING A PATTERN Line 4425  MATCHING A PATTERN
4425         The PCRE_NOTEOL option is set when calling the underlying PCRE matching         The PCRE_NOTEOL option is set when calling the underlying PCRE matching
4426         function.         function.
4427    
4428         The  portion of the string that was matched, and also any captured sub-         If  the pattern was compiled with the REG_NOSUB flag, no data about any
4429         strings, are returned via the pmatch argument, which points to an array         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
4430         of  nmatch  structures of type regmatch_t, containing the members rm_so         regexec() are ignored.
4431         and rm_eo. These contain the offset to the first character of each sub-  
4432         string and the offset to the first character after the end of each sub-         Otherwise,the portion of the string that was matched, and also any cap-
4433         string, respectively. The 0th element of  the  vector  relates  to  the         tured substrings, are returned via the pmatch argument, which points to
4434         entire  portion  of string that was matched; subsequent elements relate         an  array  of nmatch structures of type regmatch_t, containing the mem-
4435         to the capturing subpatterns of the regular expression. Unused  entries         bers rm_so and rm_eo. These contain the offset to the  first  character
4436         in the array have both structure members set to -1.         of  each  substring and the offset to the first character after the end
4437           of each substring, respectively. The 0th element of the vector  relates
4438           to  the  entire portion of string that was matched; subsequent elements
4439           relate to the capturing subpatterns of the regular  expression.  Unused
4440           entries in the array have both structure members set to -1.
4441    
4442         A  successful  match  yields  a  zero  return;  various error codes are         A  successful  match  yields  a  zero  return;  various error codes are
4443         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
# Line 3692  MEMORY USAGE Line 4464  MEMORY USAGE
4464    
4465  AUTHOR  AUTHOR
4466    
4467         Philip Hazel <ph10@cam.ac.uk>         Philip Hazel
4468         University Computing Service,         University Computing Service,
4469         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
4470    
4471  Last updated: 07 September 2004  Last updated: 16 January 2006
4472  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
4473  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
4474    
 PCRE(3)                                                                PCRE(3)  
4475    
4476    PCRECPP(3)                                                          PCRECPP(3)
4477    
4478    
4479    NAME
4480           PCRE - Perl-compatible regular expressions.
4481    
4482    
4483    SYNOPSIS OF C++ WRAPPER
4484    
4485           #include <pcrecpp.h>
4486    
4487    
4488    DESCRIPTION
4489    
4490           The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
4491           functionality was added by Giuseppe Maxia. This brief man page was con-
4492           structed  from  the  notes  in the pcrecpp.h file, which should be con-
4493           sulted for further details.
4494    
4495    
4496    MATCHING INTERFACE
4497    
4498           The "FullMatch" operation checks that supplied text matches a  supplied
4499           pattern  exactly.  If pointer arguments are supplied, it copies matched
4500           sub-strings that match sub-patterns into them.
4501    
4502             Example: successful match
4503                pcrecpp::RE re("h.*o");
4504                re.FullMatch("hello");
4505    
4506             Example: unsuccessful match (requires full match):
4507                pcrecpp::RE re("e");
4508                !re.FullMatch("hello");
4509    
4510             Example: creating a temporary RE object:
4511                pcrecpp::RE("h.*o").FullMatch("hello");
4512    
4513           You can pass in a "const char*" or a "string" for "text". The  examples
4514           below  tend to use a const char*. You can, as in the different examples
4515           above, store the RE object explicitly in a variable or use a  temporary
4516           RE  object.  The  examples below use one mode or the other arbitrarily.
4517           Either could correctly be used for any of these examples.
4518    
4519           You must supply extra pointer arguments to extract matched subpieces.
4520    
4521             Example: extracts "ruby" into "s" and 1234 into "i"
4522                int i;
4523                string s;
4524                pcrecpp::RE re("(\\w+):(\\d+)");
4525                re.FullMatch("ruby:1234", &s, &i);
4526    
4527             Example: does not try to extract any extra sub-patterns
4528                re.FullMatch("ruby:1234", &s);
4529    
4530             Example: does not try to extract into NULL
4531                re.FullMatch("ruby:1234", NULL, &i);
4532    
4533             Example: integer overflow causes failure
4534                !re.FullMatch("ruby:1234567891234", NULL, &i);
4535    
4536             Example: fails because there aren't enough sub-patterns:
4537                !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
4538    
4539             Example: fails because string cannot be stored in integer
4540                !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
4541    
4542           The provided pointer arguments can be pointers to  any  scalar  numeric
4543           type, or one of:
4544    
4545              string        (matched piece is copied to string)
4546              StringPiece   (StringPiece is mutated to point to matched piece)
4547              T             (where "bool T::ParseFrom(const char*, int)" exists)
4548              NULL          (the corresponding matched sub-pattern is not copied)
4549    
4550           The  function returns true iff all of the following conditions are sat-
4551           isfied:
4552    
4553             a. "text" matches "pattern" exactly;
4554    
4555             b. The number of matched sub-patterns is >= number of supplied
4556                pointers;
4557    
4558             c. The "i"th argument has a suitable type for holding the
4559                string captured as the "i"th sub-pattern. If you pass in
4560                NULL for the "i"th argument, or pass fewer arguments than
4561                number of sub-patterns, "i"th captured sub-pattern is
4562                ignored.
4563    
4564           The matching interface supports at most 16 arguments per call.  If  you
4565           need    more,    consider    using    the    more   general   interface
4566           pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
4567    
4568    
4569    PARTIAL MATCHES
4570    
4571           You can use the "PartialMatch" operation when you want the  pattern  to
4572           match any substring of the text.
4573    
4574             Example: simple search for a string:
4575                pcrecpp::RE("ell").PartialMatch("hello");
4576    
4577             Example: find first number in a string:
4578                int number;
4579                pcrecpp::RE re("(\\d+)");
4580                re.PartialMatch("x*100 + 20", &number);
4581                assert(number == 100);
4582    
4583    
4584    UTF-8 AND THE MATCHING INTERFACE
4585    
4586           By  default,  pattern  and text are plain text, one byte per character.
4587           The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
4588           string to be treated as UTF-8 text, still a byte stream but potentially
4589           multiple bytes per character. In practice, the text is likelier  to  be
4590           UTF-8  than  the pattern, but the match returned may depend on the UTF8
4591           flag, so always use it when matching UTF8 text. For example,  "."  will
4592           match  one  byte normally but with UTF8 set may match up to three bytes
4593           of a multi-byte character.
4594    
4595             Example:
4596                pcrecpp::RE_Options options;
4597                options.set_utf8();
4598                pcrecpp::RE re(utf8_pattern, options);
4599                re.FullMatch(utf8_string);
4600    
4601             Example: using the convenience function UTF8():
4602                pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
4603                re.FullMatch(utf8_string);
4604    
4605           NOTE: The UTF8 flag is ignored if pcre was not configured with the
4606                 --enable-utf8 flag.
4607    
4608    
4609    PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
4610    
4611           PCRE defines some modifiers to  change  the  behavior  of  the  regular
4612           expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
4613           RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
4614           rently, the following modifiers are supported:
4615    
4616              modifier              description               Perl corresponding
4617    
4618              PCRE_CASELESS         case insensitive match      /i
4619              PCRE_MULTILINE        multiple lines match        /m
4620              PCRE_DOTALL           dot matches newlines        /s
4621              PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
4622              PCRE_EXTRA            strict escape parsing       N/A
4623              PCRE_EXTENDED         ignore whitespaces          /x
4624              PCRE_UTF8             handles UTF8 chars          built-in
4625              PCRE_UNGREEDY         reverses * and *?           N/A
4626              PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
4627    
4628           (*)  Both Perl and PCRE allow non capturing parentheses by means of the
4629           "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
4630           ture, while (ab|cd) does.
4631    
4632           For  a  full  account on how each modifier works, please check the PCRE
4633           API reference page.
4634    
4635           For each modifier, there are two member functions whose  name  is  made
4636           out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
4637           instance, PCRE_CASELESS is handled by
4638    
4639             bool caseless()
4640    
4641           which returns true if the modifier is set, and
4642    
4643             RE_Options & set_caseless(bool)
4644    
4645           which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
4646           be  accessed  through  the  set_match_limit()  and match_limit() member
4647           functions. Setting match_limit to a non-zero value will limit the  exe-
4648           cution  of pcre to keep it from doing bad things like blowing the stack
4649           or taking an eternity to return a result.  A  value  of  5000  is  good
4650           enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
4651           to  zero  disables  match  limiting.  Alternatively,   you   can   call
4652           match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
4653           limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
4654           matches PCRE does; match_limit_recursion() limits the depth of internal
4655           recursion, and therefore the amount of stack that is used.
4656    
4657           Normally, to pass one or more modifiers to a RE class,  you  declare  a
4658           RE_Options object, set the appropriate options, and pass this object to
4659           a RE constructor. Example:
4660    
4661              RE_options opt;
4662              opt.set_caseless(true);
4663              if (RE("HELLO", opt).PartialMatch("hello world")) ...
4664    
4665           RE_options has two constructors. The default constructor takes no argu-
4666           ments  and creates a set of flags that are off by default. The optional
4667           parameter option_flags is to facilitate transfer of legacy code from  C
4668           programs.  This lets you do
4669    
4670              RE(pattern,
4671                RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
4672    
4673           However, new code is better off doing
4674    
4675              RE(pattern,
4676                RE_Options().set_caseless(true).set_multiline(true))
4677                  .PartialMatch(str);
4678    
4679           If you are going to pass one of the most used modifiers, there are some
4680           convenience functions that return a RE_Options class with the appropri-
4681           ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
4682           and EXTENDED().
4683    
4684           If you need to set several options at once, and you don't  want  to  go
4685           through  the pains of declaring a RE_Options object and setting several
4686           options, there is a parallel method that give you such ability  on  the
4687           fly.  You  can  concatenate several set_xxxxx() member functions, since
4688           each of them returns a reference to its class object. For  example,  to
4689           pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
4690           statement, you may write:
4691    
4692              RE(" ^ xyz \\s+ .* blah$",
4693                RE_Options()
4694                  .set_caseless(true)
4695                  .set_extended(true)
4696                  .set_multiline(true)).PartialMatch(sometext);
4697    
4698    
4699    SCANNING TEXT INCREMENTALLY
4700    
4701           The "Consume" operation may be useful if you want to  repeatedly  match
4702           regular expressions at the front of a string and skip over them as they
4703           match. This requires use of the "StringPiece" type, which represents  a
4704           sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
4705           pcrecpp namespace.
4706    
4707             Example: read lines of the form "var = value" from a string.
4708                string contents = ...;                 // Fill string somehow
4709                pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
4710    
4711                string var;
4712                int value;
4713                pcrecpp::RE re("(\\w+) = (\\d+)\n");
4714                while (re.Consume(&input, &var, &value)) {
4715                  ...;
4716                }
4717    
4718           Each successful call  to  "Consume"  will  set  "var/value",  and  also
4719           advance "input" so it points past the matched text.
4720    
4721           The  "FindAndConsume"  operation  is  similar to "Consume" but does not
4722           anchor your match at the beginning of  the  string.  For  example,  you
4723           could extract all words from a string by repeatedly calling
4724    
4725             pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
4726    
4727    
4728    PARSING HEX/OCTAL/C-RADIX NUMBERS
4729    
4730           By default, if you pass a pointer to a numeric value, the corresponding
4731           text is interpreted as a base-10  number.  You  can  instead  wrap  the
4732           pointer with a call to one of the operators Hex(), Octal(), or CRadix()
4733           to interpret the text in another base. The CRadix  operator  interprets
4734           C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
4735           base-10.
4736    
4737             Example:
4738               int a, b, c, d;
4739               pcrecpp::RE re("(.*) (.*) (.*) (.*)");
4740               re.FullMatch("100 40 0100 0x40",
4741                            pcrecpp::Octal(&a), pcrecpp::Hex(&b),
4742                            pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
4743    
4744           will leave 64 in a, b, c, and d.
4745    
4746    
4747    REPLACING PARTS OF STRINGS
4748    
4749           You can replace the first match of "pattern" in "str"  with  "rewrite".
4750           Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
4751           insert text matching corresponding parenthesized group  from  the  pat-
4752           tern. \0 in "rewrite" refers to the entire matching text. For example:
4753    
4754             string s = "yabba dabba doo";
4755             pcrecpp::RE("b+").Replace("d", &s);
4756    
4757           will  leave  "s" containing "yada dabba doo". The result is true if the
4758           pattern matches and a replacement occurs, false otherwise.
4759    
4760           GlobalReplace is like Replace except that it replaces  all  occurrences
4761           of  the  pattern  in  the string with the rewrite. Replacements are not
4762           subject to re-matching. For example:
4763    
4764             string s = "yabba dabba doo";
4765             pcrecpp::RE("b+").GlobalReplace("d", &s);
4766    
4767           will leave "s" containing "yada dada doo". It  returns  the  number  of
4768           replacements made.
4769    
4770           Extract  is like Replace, except that if the pattern matches, "rewrite"
4771           is copied into "out" (an additional argument) with substitutions.   The
4772           non-matching  portions  of "text" are ignored. Returns true iff a match
4773           occurred and the extraction happened successfully;  if no match occurs,
4774           the string is left unaffected.
4775    
4776    
4777    AUTHOR
4778    
4779           The C++ wrapper was contributed by Google Inc.
4780           Copyright (c) 2005 Google Inc.
4781    ------------------------------------------------------------------------------
4782    
4783    
4784    PCRESAMPLE(3)                                                    PCRESAMPLE(3)
4785    
4786    
4787  NAME  NAME
4788         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
4789    
4790    
4791  PCRE SAMPLE PROGRAM  PCRE SAMPLE PROGRAM
4792    
4793         A simple, complete demonstration program, to get you started with using         A simple, complete demonstration program, to get you started with using
# Line 3765  PCRE SAMPLE PROGRAM Line 4846  PCRE SAMPLE PROGRAM
4846    
4847  Last updated: 09 September 2004  Last updated: 09 September 2004
4848  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
4849  -----------------------------------------------------------------------------  ------------------------------------------------------------------------------
   

Legend:
Removed from v.75  
changed lines
  Added in v.87

  ViewVC Help
Powered by ViewVC 1.1.5