/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 211 by ph10, Thu Aug 9 09:52:43 2007 UTC revision 429 by ph10, Tue Sep 1 16:10:16 2009 UTC
# Line 2  Line 2 
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. There are  synopses of each function in the library have not been included. Neither has
6  separate text files for the pcregrep and pcretest commands.  the pcredemo program. There are separate text files for the pcregrep and
7    pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
9    
10    
# Line 18  INTRODUCTION Line 19  INTRODUCTION
19    
20         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
21         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
22         just  a  few differences. (Certain features that appeared in Python and         just  a  few  differences. Certain features that appeared in Python and
23         PCRE before they appeared in Perl are also available using  the  Python         PCRE before they appeared in Perl are also available using  the  Python
24         syntax.)         syntax.  There is also some support for certain .NET and Oniguruma syn-
25           tax items, and there is an option for  requesting  some  minor  changes
26           that give better JavaScript compatibility.
27    
28         The  current  implementation of PCRE (release 7.x) corresponds approxi-         The  current implementation of PCRE (release 8.xx) corresponds approxi-
29         mately with Perl 5.10, including support for UTF-8 encoded strings  and         mately with Perl 5.10, including support for UTF-8 encoded strings  and
30         Unicode general category properties. However, UTF-8 and Unicode support         Unicode general category properties. However, UTF-8 and Unicode support
31         has to be explicitly enabled; it is not the default. The Unicode tables         has to be explicitly enabled; it is not the default. The Unicode tables
32         correspond to Unicode release 5.0.0.         correspond to Unicode release 5.1.
33    
34         In  addition to the Perl-compatible matching function, PCRE contains an         In  addition to the Perl-compatible matching function, PCRE contains an
35         alternative matching function that matches the same  compiled  patterns         alternative matching function that matches the same  compiled  patterns
# Line 69  USER DOCUMENTATION Line 72  USER DOCUMENTATION
72         The user documentation for PCRE comprises a number  of  different  sec-         The user documentation for PCRE comprises a number  of  different  sec-
73         tions.  In the "man" format, each of these is a separate "man page". In         tions.  In the "man" format, each of these is a separate "man page". In
74         the HTML format, each is a separate page, linked from the  index  page.         the HTML format, each is a separate page, linked from the  index  page.
75         In  the  plain text format, all the sections are concatenated, for ease         In  the  plain  text format, all the sections, except the pcredemo sec-
76         of searching. The sections are as follows:         tion, are concatenated, for ease of searching. The sections are as fol-
77           lows:
78    
79           pcre              this document           pcre              this document
80           pcre-config       show PCRE installation configuration information           pcre-config       show PCRE installation configuration information
# Line 79  USER DOCUMENTATION Line 83  USER DOCUMENTATION
83           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
84           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
85           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper
86             pcredemo          a demonstration C program that uses PCRE
87           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
88           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
89           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
# Line 88  USER DOCUMENTATION Line 93  USER DOCUMENTATION
93           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
94           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
95           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
96           pcresample        discussion of the sample program           pcresample        discussion of the pcredemo program
97           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
98           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
99    
# Line 134  UTF-8 AND UNICODE PROPERTY SUPPORT Line 139  UTF-8 AND UNICODE PROPERTY SUPPORT
139    
140         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support in the code, and, in addition,  you  must  call  pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with  the PCRE_UTF8 option flag. When you do this, both the pattern and         with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
143         any subject strings that are matched against it are  treated  as  UTF-8         sequence (*UTF8). When either of these is the case,  both  the  pattern
144         strings instead of just strings of bytes.         and  any  subject  strings  that  are matched against it are treated as
145           UTF-8 strings instead of just strings of bytes.
146    
147         If  you compile PCRE with UTF-8 support, but do not use it at run time,         If you compile PCRE with UTF-8 support, but do not use it at run  time,
148         the library will be a bit bigger, but the additional run time  overhead         the  library will be a bit bigger, but the additional run time overhead
149         is limited to testing the PCRE_UTF8 flag occasionally, so should not be         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
150         very big.         very big.
151    
152         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
153         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-
154         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
155         general  category  properties such as Lu for an upper case letter or Nd         general category properties such as Lu for an upper case letter  or  Nd
156         for a decimal number, the Unicode script names such as Arabic  or  Han,         for  a  decimal number, the Unicode script names such as Arabic or Han,
157         and  the  derived  properties  Any  and L&. A full list is given in the         and the derived properties Any and L&. A full  list  is  given  in  the
158         pcrepattern documentation. Only the short names for properties are sup-         pcrepattern documentation. Only the short names for properties are sup-
159         ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-         ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-
160         ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may         ter},  is  not  supported.   Furthermore,  in Perl, many properties may
161         optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE         optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
162         does not support this.         does not support this.
163    
164     Validity of UTF-8 strings     Validity of UTF-8 strings
165    
166         When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and         When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and
167         subjects are (by default) checked for validity on entry to the relevant         subjects are (by default) checked for validity on entry to the relevant
168         functions. From release 7.3 of PCRE, the check is according  the  rules         functions.  From  release 7.3 of PCRE, the check is according the rules
169         of  RFC  3629, which are themselves derived from the Unicode specifica-         of RFC 3629, which are themselves derived from the  Unicode  specifica-
170         tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which         tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which
171         allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current         allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current
172         check allows only values in the range U+0 to U+10FFFF, excluding U+D800         check allows only values in the range U+0 to U+10FFFF, excluding U+D800
173         to U+DFFF.         to U+DFFF.
174    
175         The  excluded  code  points are the "Low Surrogate Area" of Unicode, of         The excluded code points are the "Low Surrogate Area"  of  Unicode,  of
176         which the Unicode Standard says this: "The Low Surrogate Area does  not         which  the Unicode Standard says this: "The Low Surrogate Area does not
177         contain  any  character  assignments,  consequently  no  character code         contain any  character  assignments,  consequently  no  character  code
178         charts or namelists are provided for this area. Surrogates are reserved         charts or namelists are provided for this area. Surrogates are reserved
179         for  use  with  UTF-16 and then must be used in pairs." The code points         for use with UTF-16 and then must be used in pairs."  The  code  points
180         that are encoded by UTF-16 pairs  are  available  as  independent  code         that  are  encoded  by  UTF-16  pairs are available as independent code
181         points  in  the  UTF-8  encoding.  (In other words, the whole surrogate         points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate
182         thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)         thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
183    
184         If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return         If  an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error return
185         (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know         (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
186         that your strings are valid, and therefore want to skip these checks in         that your strings are valid, and therefore want to skip these checks in
187         order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at         order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
188         compile time or at run time, PCRE assumes that the pattern  or  subject         compile  time  or at run time, PCRE assumes that the pattern or subject
189         it  is  given  (respectively)  contains only valid UTF-8 codes. In this         it is given (respectively) contains only valid  UTF-8  codes.  In  this
190         case, it does not diagnose an invalid UTF-8 string.         case, it does not diagnose an invalid UTF-8 string.
191    
192         If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,         If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
193         what  happens  depends on why the string is invalid. If the string con-         what happens depends on why the string is invalid. If the  string  con-
194         forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a         forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
195         string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,         string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,
196         apart from the initial validity test, PCRE (when in UTF-8 mode) handles         apart from the initial validity test, PCRE (when in UTF-8 mode) handles
197         strings  according  to  the more liberal rules of RFC 2279. However, if         strings according to the more liberal rules of RFC  2279.  However,  if
198         the string does not even conform to RFC 2279, the result is  undefined.         the  string does not even conform to RFC 2279, the result is undefined.
199         Your program may crash.         Your program may crash.
200    
201         If  you  want  to  process  strings  of  values  in the full range 0 to         If you want to process strings  of  values  in  the  full  range  0  to
202         0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can         0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
203         set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in         set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
204         this situation, you will have to apply your own validity check.         this situation, you will have to apply your own validity check.
205    
206     General comments about UTF-8 mode     General comments about UTF-8 mode
207    
208         1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a         1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a
209         two-byte UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
210    
211         2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8         2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8
212         characters for values greater than \177.         characters for values greater than \177.
213    
214         3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-
215         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
216    
217         4.  The dot metacharacter matches one UTF-8 character instead of a sin-         4. The dot metacharacter matches one UTF-8 character instead of a  sin-
218         gle byte.         gle byte.
219    
220         5. The escape sequence \C can be used to match a single byte  in  UTF-8         5.  The  escape sequence \C can be used to match a single byte in UTF-8
221         mode,  but  its  use can lead to some strange effects. This facility is         mode, but its use can lead to some strange effects.  This  facility  is
222         not available in the alternative matching function, pcre_dfa_exec().         not available in the alternative matching function, pcre_dfa_exec().
223    
224         6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
225         test  characters of any code value, but the characters that PCRE recog-         test characters of any code value, but the characters that PCRE  recog-
226         nizes as digits, spaces, or word characters  remain  the  same  set  as         nizes  as  digits,  spaces,  or  word characters remain the same set as
227         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
228         includes Unicode property support, because to do otherwise  would  slow         includes  Unicode  property support, because to do otherwise would slow
229         down  PCRE in many common cases. If you really want to test for a wider         down PCRE in many common cases. If you really want to test for a  wider
230         sense of, say, "digit", you must use Unicode  property  tests  such  as         sense  of,  say,  "digit",  you must use Unicode property tests such as
231         \p{Nd}.         \p{Nd}. Note that this also applies to \b, because  it  is  defined  in
232           terms of \w and \W.
233    
234         7.  Similarly,  characters that match the POSIX named character classes         7.  Similarly,  characters that match the POSIX named character classes
235         are all low-valued characters.         are all low-valued characters.
# Line 256  AUTHOR Line 263  AUTHOR
263    
264  REVISION  REVISION
265    
266         Last updated: 09 August 2007         Last updated: 01 September 2009
267         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
268  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
269    
270    
271  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
272    
273    
# Line 271  NAME Line 278  NAME
278  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
279    
280         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
281         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. It assumes use of the  configure
282         lected, by providing options to the configure script that is run before         script,  where the optional features are selected or deselected by pro-
283         the make command. The complete list of  options  for  configure  (which         viding options to configure before running the make  command.  However,
284         includes  the  standard  ones such as the selection of the installation         the  same  options  can be selected in both Unix-like and non-Unix-like
285         directory) can be obtained by running         environments using the GUI facility of  CMakeSetup  if  you  are  using
286           CMake instead of configure to build PCRE.
287    
288           The complete list of options for configure (which includes the standard
289           ones such as the  selection  of  the  installation  directory)  can  be
290           obtained by running
291    
292           ./configure --help           ./configure --help
293    
294         The following sections include  descriptions  of  options  whose  names         The  following  sections  include  descriptions  of options whose names
295         begin with --enable or --disable. These settings specify changes to the         begin with --enable or --disable. These settings specify changes to the
296         defaults for the configure command. Because of the way  that  configure         defaults  for  the configure command. Because of the way that configure
297         works,  --enable  and --disable always come in pairs, so the complemen-         works, --enable and --disable always come in pairs, so  the  complemen-
298         tary option always exists as well, but as it specifies the default,  it         tary  option always exists as well, but as it specifies the default, it
299         is not described.         is not described.
300    
301    
# Line 300  C++ SUPPORT Line 312  C++ SUPPORT
312    
313  UTF-8 SUPPORT  UTF-8 SUPPORT
314    
315         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 Unicode character strings, add
316    
317           --enable-utf8           --enable-utf8
318    
319         to  the  configure  command.  Of  itself, this does not make PCRE treat         to the configure command. Of itself, this  does  not  make  PCRE  treat
320         strings as UTF-8. As well as compiling PCRE with this option, you  also         strings  as UTF-8. As well as compiling PCRE with this option, you also
321         have  have to set the PCRE_UTF8 option when you call the pcre_compile()         have have to set the PCRE_UTF8 option when you call the  pcre_compile()
322         function.         function.
323    
324           If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
325           expects its input to be either ASCII or UTF-8 (depending on the runtime
326           option).  It  is not possible to support both EBCDIC and UTF-8 codes in
327           the same  version  of  the  library.  Consequently,  --enable-utf8  and
328           --enable-ebcdic are mutually exclusive.
329    
330    
331  UNICODE CHARACTER PROPERTY SUPPORT  UNICODE CHARACTER PROPERTY SUPPORT
332    
333         UTF-8 support allows PCRE to process character values greater than  255         UTF-8  support allows PCRE to process character values greater than 255
334         in  the  strings that it handles. On its own, however, it does not pro-         in the strings that it handles. On its own, however, it does  not  pro-
335         vide any facilities for accessing the properties of such characters. If         vide any facilities for accessing the properties of such characters. If
336         you  want  to  be able to use the pattern escapes \P, \p, and \X, which         you want to be able to use the pattern escapes \P, \p,  and  \X,  which
337         refer to Unicode character properties, you must add         refer to Unicode character properties, you must add
338    
339           --enable-unicode-properties           --enable-unicode-properties
340    
341         to the configure command. This implies UTF-8 support, even if you  have         to  the configure command. This implies UTF-8 support, even if you have
342         not explicitly requested it.         not explicitly requested it.
343    
344         Including  Unicode  property  support  adds around 30K of tables to the         Including Unicode property support adds around 30K  of  tables  to  the
345         PCRE library. Only the general category properties such as  Lu  and  Nd         PCRE  library.  Only  the general category properties such as Lu and Nd
346         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
347    
348    
349  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
350    
351         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating         By default, PCRE interprets the linefeed (LF) character  as  indicating
352         the end of a line. This is the normal newline  character  on  Unix-like         the  end  of  a line. This is the normal newline character on Unix-like
353         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use carriage return (CR)  instead,  by
354         instead, by adding         adding
355    
356           --enable-newline-is-cr           --enable-newline-is-cr
357    
358         to the  configure  command.  There  is  also  a  --enable-newline-is-lf         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
359         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
360    
361         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 349  CODE VALUE OF NEWLINE Line 367  CODE VALUE OF NEWLINE
367    
368           --enable-newline-is-anycrlf           --enable-newline-is-anycrlf
369    
370         which causes PCRE to recognize any of the three sequences  CR,  LF,  or         which  causes  PCRE  to recognize any of the three sequences CR, LF, or
371         CRLF as indicating a line ending. Finally, a fifth option, specified by         CRLF as indicating a line ending. Finally, a fifth option, specified by
372    
373           --enable-newline-is-any           --enable-newline-is-any
# Line 361  CODE VALUE OF NEWLINE Line 379  CODE VALUE OF NEWLINE
379         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
380    
381    
382    WHAT \R MATCHES
383    
384           By default, the sequence \R in a pattern matches  any  Unicode  newline
385           sequence,  whatever  has  been selected as the line ending sequence. If
386           you specify
387    
388             --enable-bsr-anycrlf
389    
390           the default is changed so that \R matches only CR, LF, or  CRLF.  What-
391           ever  is selected when PCRE is built can be overridden when the library
392           functions are called.
393    
394    
395  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
396    
397         The PCRE building process uses libtool to build both shared and  static         The PCRE building process uses libtool to build both shared and  static
# Line 496  USING EBCDIC CODE Line 527  USING EBCDIC CODE
527    
528         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
529         bles.  You  should  only  use  it if you know that you are in an EBCDIC         bles.  You  should  only  use  it if you know that you are in an EBCDIC
530         environment (for example, an IBM mainframe operating system).         environment (for example,  an  IBM  mainframe  operating  system).  The
531           --enable-ebcdic option is incompatible with --enable-utf8.
532    
533    
534    PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
535    
536           By default, pcregrep reads all files as plain text. You can build it so
537           that it recognizes files whose names end in .gz or .bz2, and reads them
538           with libz or libbz2, respectively, by adding one or both of
539    
540             --enable-pcregrep-libz
541             --enable-pcregrep-libbz2
542    
543           to the configure command. These options naturally require that the rel-
544           evant libraries are installed on your system. Configuration  will  fail
545           if they are not.
546    
547    
548    PCRETEST OPTION FOR LIBREADLINE SUPPORT
549    
550           If you add
551    
552             --enable-pcretest-libreadline
553    
554           to  the  configure  command,  pcretest  is  linked with the libreadline
555           library, and when its input is from a terminal, it reads it  using  the
556           readline() function. This provides line-editing and history facilities.
557           Note that libreadline is GPL-licenced, so if you distribute a binary of
558           pcretest linked in this way, there may be licensing issues.
559    
560           Setting  this  option  causes  the -lreadline option to be added to the
561           pcretest build. In many operating environments with  a  sytem-installed
562           libreadline this is sufficient. However, in some environments (e.g.  if
563           an unmodified distribution version of readline is in use),  some  extra
564           configuration  may  be necessary. The INSTALL file for libreadline says
565           this:
566    
567             "Readline uses the termcap functions, but does not link with the
568             termcap or curses library itself, allowing applications which link
569             with readline the to choose an appropriate library."
570    
571           If your environment has not been set up so that an appropriate  library
572           is automatically included, you may need to add something like
573    
574             LIBS="-ncurses"
575    
576           immediately before the configure command.
577    
578    
579  SEE ALSO  SEE ALSO
# Line 513  AUTHOR Line 590  AUTHOR
590    
591  REVISION  REVISION
592    
593         Last updated: 30 July 2007         Last updated: 17 March 2009
594         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
595  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
596    
597    
598  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
599    
600    
# Line 662  THE ALTERNATIVE MATCHING ALGORITHM Line 739  THE ALTERNATIVE MATCHING ALGORITHM
739         tive algorithm moves through the subject  string  one  character  at  a         tive algorithm moves through the subject  string  one  character  at  a
740         time, for all active paths through the tree.         time, for all active paths through the tree.
741    
742         8.  None  of  the  backtracking control verbs such as (*PRUNE) are sup-         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
743         ported.         are not supported. (*FAIL) is supported, and  behaves  like  a  failing
744           negative assertion.
745    
746    
747  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
748    
749         Using the alternative matching algorithm provides the following  advan-         Using  the alternative matching algorithm provides the following advan-
750         tages:         tages:
751    
752         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
753         ically found, and in particular, the longest match is  found.  To  find         ically  found,  and  in particular, the longest match is found. To find
754         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
755         things with callouts.         things with callouts.
756    
757         2. There is much better support for partial matching. The  restrictions         2.  Because  the  alternative  algorithm  scans the subject string just
758         on  the content of the pattern that apply when using the standard algo-         once, and never needs to backtrack, it is possible to  pass  very  long
759         rithm for partial matching do not apply to the  alternative  algorithm.         subject  strings  to  the matching function in several pieces, checking
        For  non-anchored patterns, the starting position of a partial match is  
        available.  
   
        3. Because the alternative algorithm  scans  the  subject  string  just  
        once,  and  never  needs to backtrack, it is possible to pass very long  
        subject strings to the matching function in  several  pieces,  checking  
760         for partial matching each time.         for partial matching each time.
761    
762    
# Line 692  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 764  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
764    
765         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
766    
767         1.  It  is  substantially  slower  than the standard algorithm. This is         1. It is substantially slower than  the  standard  algorithm.  This  is
768         partly because it has to search for all possible matches, but  is  also         partly  because  it has to search for all possible matches, but is also
769         because it is less susceptible to optimization.         because it is less susceptible to optimization.
770    
771         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 711  AUTHOR Line 783  AUTHOR
783    
784  REVISION  REVISION
785    
786         Last updated: 08 August 2007         Last updated: 25 August 2009
787         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
788  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
789    
790    
791  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
792    
793    
# Line 823  PCRE API OVERVIEW Line 895  PCRE API OVERVIEW
895         pcre_exec() are used for compiling and matching regular expressions  in         pcre_exec() are used for compiling and matching regular expressions  in
896         a  Perl-compatible  manner. A sample program that demonstrates the sim-         a  Perl-compatible  manner. A sample program that demonstrates the sim-
897         plest way of using them is provided in the file  called  pcredemo.c  in         plest way of using them is provided in the file  called  pcredemo.c  in
898         the  source distribution. The pcresample documentation describes how to         the PCRE source distribution. A listing of this program is given in the
899         run it.         pcredemo documentation, and the pcresample documentation describes  how
900           to compile and run it.
901    
902         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
903         ble,  is  also provided. This uses a different algorithm for the match-         ble, is also provided. This uses a different algorithm for  the  match-
904         ing. The alternative algorithm finds all possible matches (at  a  given         ing.  The  alternative algorithm finds all possible matches (at a given
905         point  in  the subject), and scans the subject just once. However, this         point in the subject), and scans the subject just once.  However,  this
906         algorithm does not return captured substrings. A description of the two         algorithm does not return captured substrings. A description of the two
907         matching  algorithms and their advantages and disadvantages is given in         matching algorithms and their advantages and disadvantages is given  in
908         the pcrematching documentation.         the pcrematching documentation.
909    
910         In addition to the main compiling and  matching  functions,  there  are         In  addition  to  the  main compiling and matching functions, there are
911         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
912         string that is matched by pcre_exec(). They are:         string that is matched by pcre_exec(). They are:
913    
# Line 849  PCRE API OVERVIEW Line 922  PCRE API OVERVIEW
922         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
923         to free the memory used for extracted strings.         to free the memory used for extracted strings.
924    
925         The  function  pcre_maketables()  is  used  to build a set of character         The function pcre_maketables() is used to  build  a  set  of  character
926         tables  in  the  current  locale   for   passing   to   pcre_compile(),         tables   in   the   current   locale  for  passing  to  pcre_compile(),
927         pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is         pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
928         provided for specialist use.  Most  commonly,  no  special  tables  are         provided  for  specialist  use.  Most  commonly,  no special tables are
929         passed,  in  which case internal tables that are generated when PCRE is         passed, in which case internal tables that are generated when  PCRE  is
930         built are used.         built are used.
931    
932         The function pcre_fullinfo() is used to find out  information  about  a         The  function  pcre_fullinfo()  is used to find out information about a
933         compiled  pattern; pcre_info() is an obsolete version that returns only         compiled pattern; pcre_info() is an obsolete version that returns  only
934         some of the available information, but is retained for  backwards  com-         some  of  the available information, but is retained for backwards com-
935         patibility.   The function pcre_version() returns a pointer to a string         patibility.  The function pcre_version() returns a pointer to a  string
936         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
937    
938         The function pcre_refcount() maintains a  reference  count  in  a  data         The  function  pcre_refcount()  maintains  a  reference count in a data
939         block  containing  a compiled pattern. This is provided for the benefit         block containing a compiled pattern. This is provided for  the  benefit
940         of object-oriented applications.         of object-oriented applications.
941    
942         The global variables pcre_malloc and pcre_free  initially  contain  the         The  global  variables  pcre_malloc and pcre_free initially contain the
943         entry  points  of  the  standard malloc() and free() functions, respec-         entry points of the standard malloc()  and  free()  functions,  respec-
944         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
945         so  a  calling  program  can replace them if it wishes to intercept the         so a calling program can replace them if it  wishes  to  intercept  the
946         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
947    
948         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
949         indirections  to  memory  management functions. These special functions         indirections to memory management functions.  These  special  functions
950         are used only when PCRE is compiled to use  the  heap  for  remembering         are  used  only  when  PCRE is compiled to use the heap for remembering
951         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
952         function. See the pcrebuild documentation for  details  of  how  to  do         function.  See  the  pcrebuild  documentation  for details of how to do
953         this.  It  is  a non-standard way of building PCRE, for use in environ-         this. It is a non-standard way of building PCRE, for  use  in  environ-
954         ments that have limited stacks. Because of the greater  use  of  memory         ments  that  have  limited stacks. Because of the greater use of memory
955         management,  it  runs  more  slowly. Separate functions are provided so         management, it runs more slowly. Separate  functions  are  provided  so
956         that special-purpose external code can be  used  for  this  case.  When         that  special-purpose  external  code  can  be used for this case. When
957         used,  these  functions  are always called in a stack-like manner (last         used, these functions are always called in a  stack-like  manner  (last
958         obtained, first freed), and always for memory blocks of the same  size.         obtained,  first freed), and always for memory blocks of the same size.
959         There  is  a discussion about PCRE's stack usage in the pcrestack docu-         There is a discussion about PCRE's stack usage in the  pcrestack  docu-
960         mentation.         mentation.
961    
962         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
963         by  the  caller  to  a "callout" function, which PCRE will then call at         by the caller to a "callout" function, which PCRE  will  then  call  at
964         specified points during a matching operation. Details are given in  the         specified  points during a matching operation. Details are given in the
965         pcrecallout documentation.         pcrecallout documentation.
966    
967    
968  NEWLINES  NEWLINES
969    
970         PCRE  supports five different conventions for indicating line breaks in         PCRE supports five different conventions for indicating line breaks  in
971         strings: a single CR (carriage return) character, a  single  LF  (line-         strings:  a  single  CR (carriage return) character, a single LF (line-
972         feed) character, the two-character sequence CRLF, any of the three pre-         feed) character, the two-character sequence CRLF, any of the three pre-
973         ceding, or any Unicode newline sequence. The Unicode newline  sequences         ceding,  or any Unicode newline sequence. The Unicode newline sequences
974         are  the  three just mentioned, plus the single characters VT (vertical         are the three just mentioned, plus the single characters  VT  (vertical
975         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line         tab,  U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
976         separator, U+2028), and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
977    
978         Each  of  the first three conventions is used by at least one operating         Each of the first three conventions is used by at least  one  operating
979         system as its standard newline sequence. When PCRE is built, a  default         system  as its standard newline sequence. When PCRE is built, a default
980         can  be  specified.  The default default is LF, which is the Unix stan-         can be specified.  The default default is LF, which is the  Unix  stan-
981         dard. When PCRE is run, the default can be overridden,  either  when  a         dard.  When  PCRE  is run, the default can be overridden, either when a
982         pattern is compiled, or when it is matched.         pattern is compiled, or when it is matched.
983    
984           At compile time, the newline convention can be specified by the options
985           argument  of  pcre_compile(), or it can be specified by special text at
986           the start of the pattern itself; this overrides any other settings. See
987           the pcrepattern page for details of the special character sequences.
988    
989         In the PCRE documentation the word "newline" is used to mean "the char-         In the PCRE documentation the word "newline" is used to mean "the char-
990         acter or pair of characters that indicate a line break". The choice  of         acter or pair of characters that indicate a line break". The choice  of
991         newline  convention  affects  the  handling of the dot, circumflex, and         newline  convention  affects  the  handling of the dot, circumflex, and
992         dollar metacharacters, the handling of #-comments in /x mode, and, when         dollar metacharacters, the handling of #-comments in /x mode, and, when
993         CRLF  is a recognized line ending sequence, the match position advance-         CRLF  is a recognized line ending sequence, the match position advance-
994         ment for a non-anchored pattern. The choice of newline convention  does         ment for a non-anchored pattern. There is more detail about this in the
995         not affect the interpretation of the \n or \r escape sequences.         section on pcre_exec() options below.
996    
997           The  choice of newline convention does not affect the interpretation of
998           the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
999           which is controlled in a similar way, but by separate options.
1000    
1001    
1002  MULTITHREADING  MULTITHREADING
# Line 924  MULTITHREADING Line 1006  MULTITHREADING
1006         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1007         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
1008    
1009         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
1010         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
1011         at once.         at once.
1012    
# Line 932  MULTITHREADING Line 1014  MULTITHREADING
1014  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
1015    
1016         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
1017         later time, possibly by a different program, and even on a  host  other         later  time,  possibly by a different program, and even on a host other
1018         than  the  one  on  which  it  was  compiled.  Details are given in the         than the one on which  it  was  compiled.  Details  are  given  in  the
1019         pcreprecompile documentation. However, compiling a  regular  expression         pcreprecompile  documentation.  However, compiling a regular expression
1020         with  one version of PCRE for use with a different version is not guar-         with one version of PCRE for use with a different version is not  guar-
1021         anteed to work and may cause crashes.         anteed to work and may cause crashes.
1022    
1023    
# Line 943  CHECKING BUILD-TIME OPTIONS Line 1025  CHECKING BUILD-TIME OPTIONS
1025    
1026         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1027    
1028         The function pcre_config() makes it possible for a PCRE client to  dis-         The  function pcre_config() makes it possible for a PCRE client to dis-
1029         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
1030         The pcrebuild documentation has more details about these optional  fea-         The  pcrebuild documentation has more details about these optional fea-
1031         tures.         tures.
1032    
1033         The  first  argument  for pcre_config() is an integer, specifying which         The first argument for pcre_config() is an  integer,  specifying  which
1034         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
1035         into  which  the  information  is  placed. The following information is         into which the information is  placed.  The  following  information  is
1036         available:         available:
1037    
1038           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
1039    
1040         The output is an integer that is set to one if UTF-8 support is  avail-         The  output is an integer that is set to one if UTF-8 support is avail-
1041         able; otherwise it is set to zero.         able; otherwise it is set to zero.
1042    
1043           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
1044    
1045         The  output  is  an  integer  that is set to one if support for Unicode         The output is an integer that is set to  one  if  support  for  Unicode
1046         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
1047    
1048           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1049    
1050         The output is an integer whose value specifies  the  default  character         The  output  is  an integer whose value specifies the default character
1051         sequence  that is recognized as meaning "newline". The four values that         sequence that is recognized as meaning "newline". The four values  that
1052         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1053         and  -1  for  ANY. The default should normally be the standard sequence         and -1 for ANY.  Though they are derived from ASCII,  the  same  values
1054         for your operating system.         are returned in EBCDIC environments. The default should normally corre-
1055           spond to the standard sequence for your operating system.
1056    
1057             PCRE_CONFIG_BSR
1058    
1059           The output is an integer whose value indicates what character sequences
1060           the  \R  escape sequence matches by default. A value of 0 means that \R
1061           matches any Unicode line ending sequence; a value of 1  means  that  \R
1062           matches only CR, LF, or CRLF. The default can be overridden when a pat-
1063           tern is compiled or matched.
1064    
1065           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1066    
# Line 988  CHECKING BUILD-TIME OPTIONS Line 1079  CHECKING BUILD-TIME OPTIONS
1079    
1080           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1081    
1082         The output is an integer that gives the default limit for the number of         The  output is a long integer that gives the default limit for the num-
1083         internal matching function calls in a  pcre_exec()  execution.  Further         ber of internal matching function calls  in  a  pcre_exec()  execution.
1084         details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1085    
1086           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1087    
1088         The  output is an integer that gives the default limit for the depth of         The output is a long integer that gives the default limit for the depth
1089         recursion when calling the internal matching function in a  pcre_exec()         of  recursion  when  calling  the  internal  matching  function  in   a
1090         execution. Further details are given with pcre_exec() below.         pcre_exec()  execution.  Further  details  are  given  with pcre_exec()
1091           below.
1092    
1093           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1094    
1095         The  output is an integer that is set to one if internal recursion when         The output is an integer that is set to one if internal recursion  when
1096         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
1097         the  stack  to remember their state. This is the usual way that PCRE is         the stack to remember their state. This is the usual way that  PCRE  is
1098         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
1099         on  the  heap  instead  of  recursive  function  calls.  In  this case,         on the  heap  instead  of  recursive  function  calls.  In  this  case,
1100         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1101         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1102    
1103    
# Line 1022  COMPILING A PATTERN Line 1114  COMPILING A PATTERN
1114    
1115         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
1116         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
1117         the  two interfaces is that pcre_compile2() has an additional argument,         the two interfaces is that pcre_compile2() has an additional  argument,
1118         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error code can be returned.
1119    
1120         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
1121         the  pattern  argument.  A  pointer to a single block of memory that is         the pattern argument. A pointer to a single block  of  memory  that  is
1122         obtained via pcre_malloc is returned. This contains the  compiled  code         obtained  via  pcre_malloc is returned. This contains the compiled code
1123         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
1124         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
1125         It is up to the caller to free the memory (via pcre_free) when it is no         It is up to the caller to free the memory (via pcre_free) when it is no
1126         longer required.         longer required.
1127    
1128         Although the compiled code of a PCRE regex is relocatable, that is,  it         Although  the compiled code of a PCRE regex is relocatable, that is, it
1129         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1130         fully relocatable, because it may contain a copy of the tableptr  argu-         fully  relocatable, because it may contain a copy of the tableptr argu-
1131         ment, which is an address (see below).         ment, which is an address (see below).
1132    
1133         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
1134         pilation. It should be zero if no options are required.  The  available         pilation.  It  should be zero if no options are required. The available
1135         options  are  described  below. Some of them, in particular, those that         options are described below. Some of them (in  particular,  those  that
1136         are compatible with Perl, can also be set and  unset  from  within  the         are  compatible  with  Perl,  but also some others) can also be set and
1137         pattern  (see  the  detailed  description in the pcrepattern documenta-         unset from within the pattern (see  the  detailed  description  in  the
1138         tion). For these options, the contents of the options  argument  speci-         pcrepattern  documentation). For those options that can be different in
1139         fies  their initial settings at the start of compilation and execution.         different parts of the pattern, the contents of  the  options  argument
1140         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time         specifies their initial settings at the start of compilation and execu-
1141         of matching as well as at compile time.         tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the
1142           time of matching as well as at compile time.
1143    
1144         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1145         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
# Line 1100  COMPILING A PATTERN Line 1193  COMPILING A PATTERN
1193         all with number 255, before each pattern item. For  discussion  of  the         all with number 255, before each pattern item. For  discussion  of  the
1194         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
1195    
1196             PCRE_BSR_ANYCRLF
1197             PCRE_BSR_UNICODE
1198    
1199           These options (which are mutually exclusive) control what the \R escape
1200           sequence matches. The choice is either to match only CR, LF,  or  CRLF,
1201           or to match any Unicode newline sequence. The default is specified when
1202           PCRE is built. It can be overridden from within the pattern, or by set-
1203           ting an option when a compiled pattern is matched.
1204    
1205           PCRE_CASELESS           PCRE_CASELESS
1206    
1207         If  this  bit is set, letters in the pattern match both upper and lower         If  this  bit is set, letters in the pattern match both upper and lower
# Line 1173  COMPILING A PATTERN Line 1275  COMPILING A PATTERN
1275         before  or  at  the  first  newline  in  the subject string, though the         before  or  at  the  first  newline  in  the subject string, though the
1276         matched text may continue over the newline.         matched text may continue over the newline.
1277    
1278             PCRE_JAVASCRIPT_COMPAT
1279    
1280           If this option is set, PCRE's behaviour is changed in some ways so that
1281           it  is  compatible with JavaScript rather than Perl. The changes are as
1282           follows:
1283    
1284           (1) A lone closing square bracket in a pattern  causes  a  compile-time
1285           error,  because this is illegal in JavaScript (by default it is treated
1286           as a data character). Thus, the pattern AB]CD becomes illegal when this
1287           option is set.
1288    
1289           (2)  At run time, a back reference to an unset subpattern group matches
1290           an empty string (by default this causes the current  matching  alterna-
1291           tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
1292           set (assuming it can find an "a" in the subject), whereas it  fails  by
1293           default, for Perl compatibility.
1294    
1295           PCRE_MULTILINE           PCRE_MULTILINE
1296    
1297         By default, PCRE treats the subject string as consisting  of  a  single         By  default,  PCRE  treats the subject string as consisting of a single
1298         line  of characters (even if it actually contains newlines). The "start         line of characters (even if it actually contains newlines). The  "start
1299         of line" metacharacter (^) matches only at the  start  of  the  string,         of  line"  metacharacter  (^)  matches only at the start of the string,
1300         while  the  "end  of line" metacharacter ($) matches only at the end of         while the "end of line" metacharacter ($) matches only at  the  end  of
1301         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1302         is set). This is the same as Perl.         is set). This is the same as Perl.
1303    
1304         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
1305         constructs match immediately following or immediately  before  internal         constructs  match  immediately following or immediately before internal
1306         newlines  in  the  subject string, respectively, as well as at the very         newlines in the subject string, respectively, as well as  at  the  very
1307         start and end. This is equivalent to Perl's /m option, and  it  can  be         start  and  end.  This is equivalent to Perl's /m option, and it can be
1308         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
1309         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,
1310         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1311    
1312           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1196  COMPILING A PATTERN Line 1315  COMPILING A PATTERN
1315           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1316           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1317    
1318         These  options  override the default newline definition that was chosen         These options override the default newline definition that  was  chosen
1319         when PCRE was built. Setting the first or the second specifies  that  a         when  PCRE  was built. Setting the first or the second specifies that a
1320         newline  is  indicated  by a single character (CR or LF, respectively).         newline is indicated by a single character (CR  or  LF,  respectively).
1321         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
1322         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
1323         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
1324         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
1325         recognized. The Unicode newline sequences are the three just mentioned,         recognized. The Unicode newline sequences are the three just mentioned,
1326         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,         plus the single characters VT (vertical  tab,  U+000B),  FF  (formfeed,
1327         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS         U+000C),  NEL  (next line, U+0085), LS (line separator, U+2028), and PS
1328         (paragraph  separator,  U+2029).  The  last  two are recognized only in         (paragraph separator, U+2029). The last  two  are  recognized  only  in
1329         UTF-8 mode.         UTF-8 mode.
1330    
1331         The newline setting in the  options  word  uses  three  bits  that  are         The  newline  setting  in  the  options  word  uses three bits that are
1332         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
1333         used (default plus the five values above). This means that if  you  set         used  (default  plus the five values above). This means that if you set
1334         more  than one newline option, the combination may or may not be sensi-         more than one newline option, the combination may or may not be  sensi-
1335         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1336         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and
1337         cause an error.         cause an error.
1338    
1339         The only time that a line break is specially recognized when  compiling         The  only time that a line break is specially recognized when compiling
1340         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a
1341         character class is encountered. This indicates  a  comment  that  lasts         character  class  is  encountered.  This indicates a comment that lasts
1342         until  after the next line break sequence. In other circumstances, line         until after the next line break sequence. In other circumstances,  line
1343         break  sequences  are  treated  as  literal  data,   except   that   in         break   sequences   are   treated  as  literal  data,  except  that  in
1344         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1345         and are therefore ignored.         and are therefore ignored.
1346    
1347         The newline option that is set at compile time becomes the default that         The newline option that is set at compile time becomes the default that
1348         is  used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1349    
1350           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1351    
# Line 1285  COMPILATION ERROR CODES Line 1404  COMPILATION ERROR CODES
1404            9  nothing to repeat            9  nothing to repeat
1405           10  [this code is not in use]           10  [this code is not in use]
1406           11  internal error: unexpected repeat           11  internal error: unexpected repeat
1407           12  unrecognized character after (?           12  unrecognized character after (? or (?-
1408           13  POSIX named classes are supported only within a class           13  POSIX named classes are supported only within a class
1409           14  missing )           14  missing )
1410           15  reference to non-existent subpattern           15  reference to non-existent subpattern
# Line 1293  COMPILATION ERROR CODES Line 1412  COMPILATION ERROR CODES
1412           17  unknown option bit(s) set           17  unknown option bit(s) set
1413           18  missing ) after comment           18  missing ) after comment
1414           19  [this code is not in use]           19  [this code is not in use]
1415           20  regular expression too large           20  regular expression is too large
1416           21  failed to get memory           21  failed to get memory
1417           22  unmatched parentheses           22  unmatched parentheses
1418           23  internal error: code overflow           23  internal error: code overflow
# Line 1322  COMPILATION ERROR CODES Line 1441  COMPILATION ERROR CODES
1441           46  malformed \P or \p sequence           46  malformed \P or \p sequence
1442           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1443           48  subpattern name is too long (maximum 32 characters)           48  subpattern name is too long (maximum 32 characters)
1444           49  too many named subpatterns (maximum 10,000)           49  too many named subpatterns (maximum 10000)
1445           50  [this code is not in use]           50  [this code is not in use]
1446           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 (not in UTF-8 mode)
1447           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
# Line 1330  COMPILATION ERROR CODES Line 1449  COMPILATION ERROR CODES
1449         found         found
1450           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
1451           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
1452           56  inconsistent NEWLINE options"           56  inconsistent NEWLINE options
1453           57  \g is not followed by a braced name or an optionally braced           57  \g is not followed by a braced, angle-bracketed, or quoted
1454                 non-zero number                 name/number or by a plain number
1455           58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number           58  a numbered reference must not be zero
1456             59  (*VERB) with an argument is not supported
1457             60  (*VERB) not recognized
1458             61  number is too big
1459             62  subpattern name expected
1460             63  digit expected after (?+
1461             64  ] is an invalid data character in JavaScript compatibility mode
1462    
1463           The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
1464           values may be used if the limits were changed when PCRE was built.
1465    
1466    
1467  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1341  STUDYING A PATTERN Line 1469  STUDYING A PATTERN
1469         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1470              const char **errptr);              const char **errptr);
1471    
1472         If  a  compiled  pattern is going to be used several times, it is worth         If a compiled pattern is going to be used several times,  it  is  worth
1473         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1474         matching.  The function pcre_study() takes a pointer to a compiled pat-         matching. The function pcre_study() takes a pointer to a compiled  pat-
1475         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1476         information  that  will  help speed up matching, pcre_study() returns a         information that will help speed up matching,  pcre_study()  returns  a
1477         pointer to a pcre_extra block, in which the study_data field points  to         pointer  to a pcre_extra block, in which the study_data field points to
1478         the results of the study.         the results of the study.
1479    
1480         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1481         pcre_exec(). However, a pcre_extra block  also  contains  other  fields         pcre_exec().  However,  a  pcre_extra  block also contains other fields
1482         that  can  be  set  by the caller before the block is passed; these are         that can be set by the caller before the block  is  passed;  these  are
1483         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1484    
1485         If studying the pattern does not  produce  any  additional  information         If  studying  the  pattern  does not produce any additional information
1486         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1487         wants to pass any of the other fields to pcre_exec(), it  must  set  up         wants  to  pass  any of the other fields to pcre_exec(), it must set up
1488         its own pcre_extra block.         its own pcre_extra block.
1489    
1490         The  second  argument of pcre_study() contains option bits. At present,         The second argument of pcre_study() contains option bits.  At  present,
1491         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1492    
1493         The third argument for pcre_study() is a pointer for an error  message.         The  third argument for pcre_study() is a pointer for an error message.
1494         If  studying  succeeds  (even  if no data is returned), the variable it         If studying succeeds (even if no data is  returned),  the  variable  it
1495         points to is set to NULL. Otherwise it is set to  point  to  a  textual         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1496         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1497         must not try to free it. You should test the  error  pointer  for  NULL         must  not  try  to  free it. You should test the error pointer for NULL
1498         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1499    
1500         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 1378  STUDYING A PATTERN Line 1506  STUDYING A PATTERN
1506             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1507    
1508         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1509         that do not have a single fixed starting character. A bitmap of  possi-         that  do not have a single fixed starting character. A bitmap of possi-
1510         ble starting bytes is created.         ble starting bytes is created.
1511    
1512    
1513  LOCALE SUPPORT  LOCALE SUPPORT
1514    
1515         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1516         letters, digits, or whatever, by reference to a set of tables,  indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1517         by  character  value.  When running in UTF-8 mode, this applies only to         by character value. When running in UTF-8 mode, this  applies  only  to
1518         characters with codes less than 128. Higher-valued  codes  never  match         characters  with  codes  less than 128. Higher-valued codes never match
1519         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1520         with Unicode character property support. The use of locales  with  Uni-         with  Unicode  character property support. The use of locales with Uni-
1521         code  is discouraged. If you are handling characters with codes greater         code is discouraged. If you are handling characters with codes  greater
1522         than 128, you should either use UTF-8 and Unicode, or use locales,  but         than  128, you should either use UTF-8 and Unicode, or use locales, but
1523         not try to mix the two.         not try to mix the two.
1524    
1525         PCRE  contains  an  internal set of tables that are used when the final         PCRE contains an internal set of tables that are used  when  the  final
1526         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1527         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
1528         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
1529         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
1530         which may cause them to be different.         which may cause them to be different.
1531    
1532         The internal tables can always be overridden by tables supplied by  the         The  internal tables can always be overridden by tables supplied by the
1533         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
1534         from the default. As more and more applications change  to  using  Uni-         from  the  default.  As more and more applications change to using Uni-
1535         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
1536    
1537         External  tables  are  built by calling the pcre_maketables() function,         External tables are built by calling  the  pcre_maketables()  function,
1538         which has no arguments, in the relevant locale. The result can then  be         which  has no arguments, in the relevant locale. The result can then be
1539         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1540         example, to build and use tables that are appropriate  for  the  French         example,  to  build  and use tables that are appropriate for the French
1541         locale  (where  accented  characters  with  values greater than 128 are         locale (where accented characters with  values  greater  than  128  are
1542         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1543    
1544           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1545           tables = pcre_maketables();           tables = pcre_maketables();
1546           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1547    
1548         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1549         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
1550    
1551         When  pcre_maketables()  runs,  the  tables are built in memory that is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1552         obtained via pcre_malloc. It is the caller's responsibility  to  ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1553         that  the memory containing the tables remains available for as long as         that the memory containing the tables remains available for as long  as
1554         it is needed.         it is needed.
1555    
1556         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1557         pattern,  and the same tables are used via this pointer by pcre_study()         pattern, and the same tables are used via this pointer by  pcre_study()
1558         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1559         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1560         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1561    
1562         It is possible to pass a table pointer or NULL (indicating the  use  of         It  is  possible to pass a table pointer or NULL (indicating the use of
1563         the  internal  tables)  to  pcre_exec(). Although not intended for this         the internal tables) to pcre_exec(). Although  not  intended  for  this
1564         purpose, this facility could be used to match a pattern in a  different         purpose,  this facility could be used to match a pattern in a different
1565         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1566         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1567    
# Line 1443  INFORMATION ABOUT A PATTERN Line 1571  INFORMATION ABOUT A PATTERN
1571         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1572              int what, void *where);              int what, void *where);
1573    
1574         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1575         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1576         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1577    
1578         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1579         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1580         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1581         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1582         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1583         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1584    
1585           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1459  INFORMATION ABOUT A PATTERN Line 1587  INFORMATION ABOUT A PATTERN
1587           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1588           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1589    
1590         The  "magic  number" is placed at the start of each compiled pattern as         The "magic number" is placed at the start of each compiled  pattern  as
1591         an simple check against passing an arbitrary memory pointer. Here is  a         an  simple check against passing an arbitrary memory pointer. Here is a
1592         typical  call  of pcre_fullinfo(), to obtain the length of the compiled         typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1593         pattern:         pattern:
1594    
1595           int rc;           int rc;
# Line 1472  INFORMATION ABOUT A PATTERN Line 1600  INFORMATION ABOUT A PATTERN
1600             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1601             &length);         /* where to put the data */             &length);         /* where to put the data */
1602    
1603         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1604         are as follows:         are as follows:
1605    
1606           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1607    
1608         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1609         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1610         there are no back references.         there are no back references.
1611    
1612           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1613    
1614         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1615         argument should point to an int variable.         argument should point to an int variable.
1616    
1617           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1618    
1619         Return a pointer to the internal default character tables within  PCRE.         Return  a pointer to the internal default character tables within PCRE.
1620         The  fourth  argument should point to an unsigned char * variable. This         The fourth argument should point to an unsigned char *  variable.  This
1621         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1622         tion.  External  callers  can  cause PCRE to use its internal tables by         tion. External callers can cause PCRE to use  its  internal  tables  by
1623         passing a NULL table pointer.         passing a NULL table pointer.
1624    
1625           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1626    
1627         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1628         non-anchored  pattern. The fourth argument should point to an int vari-         non-anchored pattern. The fourth argument should point to an int  vari-
1629         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1630         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1631    
1632         If  there  is  a  fixed first byte, for example, from a pattern such as         If there is a fixed first byte, for example, from  a  pattern  such  as
1633         (cat|cow|coyote), its value is returned. Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1634    
1635         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1636         branch starts with "^", or         branch starts with "^", or
1637    
1638         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1639         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1640    
1641         -1 is returned, indicating that the pattern matches only at  the  start         -1  is  returned, indicating that the pattern matches only at the start
1642         of  a  subject string or after any newline within the string. Otherwise         of a subject string or after any newline within the  string.  Otherwise
1643         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1644    
1645           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1646    
1647         If the pattern was studied, and this resulted in the construction of  a         If  the pattern was studied, and this resulted in the construction of a
1648         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1649         matching string, a pointer to the table is returned. Otherwise NULL  is         matching  string, a pointer to the table is returned. Otherwise NULL is
1650         returned.  The fourth argument should point to an unsigned char * vari-         returned. The fourth argument should point to an unsigned char *  vari-
1651         able.         able.
1652    
1653             PCRE_INFO_HASCRORLF
1654    
1655           Return  1  if  the  pattern  contains any explicit matches for CR or LF
1656           characters, otherwise 0. The fourth argument should  point  to  an  int
1657           variable.  An explicit match is either a literal CR or LF character, or
1658           \r or \n.
1659    
1660           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
1661    
1662         Return 1 if the (?J) option setting is used in the  pattern,  otherwise         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
1663         0. The fourth argument should point to an int variable. The (?J) inter-         otherwise  0. The fourth argument should point to an int variable. (?J)
1664         nal option setting changes the local PCRE_DUPNAMES option.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1665    
1666           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1667    
# Line 1585  INFORMATION ABOUT A PATTERN Line 1720  INFORMATION ABOUT A PATTERN
1720           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
1721    
1722         Return 1 if the pattern can be used for partial matching, otherwise  0.         Return 1 if the pattern can be used for partial matching, otherwise  0.
1723         The  fourth  argument  should point to an int variable. The pcrepartial         The fourth argument should point to an int variable. From release 8.00,
1724         documentation lists the restrictions that apply to patterns  when  par-         this always returns 1, because the restrictions that previously applied
1725         tial matching is used.         to  partial  matching  have  been lifted. The pcrepartial documentation
1726           gives details of partial matching.
1727    
1728           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1729    
1730         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1731         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1732         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1733         by any top-level option settings at the start of the pattern itself. In         by any top-level option settings at the start of the pattern itself. In
1734         other  words,  they are the options that will be in force when matching         other words, they are the options that will be in force  when  matching
1735         starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with         starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1736         the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,         the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1737         and PCRE_EXTENDED.         and PCRE_EXTENDED.
1738    
1739         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1740         alternatives begin with one of the following:         alternatives begin with one of the following:
1741    
1742           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1614  INFORMATION ABOUT A PATTERN Line 1750  INFORMATION ABOUT A PATTERN
1750    
1751           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1752    
1753         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1754         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1755         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1756         size_t variable.         size_t variable.
# Line 1622  INFORMATION ABOUT A PATTERN Line 1758  INFORMATION ABOUT A PATTERN
1758           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1759    
1760         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1761         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1762         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1763         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1764         variable.         variable.
1765    
1766    
# Line 1632  OBSOLETE INFO FUNCTION Line 1768  OBSOLETE INFO FUNCTION
1768    
1769         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1770    
1771         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1772         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1773         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1774         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1775         lowing negative numbers:         lowing negative numbers:
1776    
1777           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1778           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1779    
1780         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1781         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1782         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1783    
1784         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1785         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1786         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1787    
1788    
# Line 1654  REFERENCE COUNTS Line 1790  REFERENCE COUNTS
1790    
1791         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1792    
1793         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1794         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1795         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1796         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1797         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1798    
1799         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1800         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1801         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1802         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1803         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1804         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1805    
1806         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1807         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1808         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1809    
1810    
# Line 1762  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1898  MATCHING A PATTERN: THE TRADITIONAL FUNC
1898         the  total number of calls, because not all calls to match() are recur-         the  total number of calls, because not all calls to match() are recur-
1899         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1900    
1901         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1902         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1903         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1904    
1905         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1906         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1907         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1908         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1909         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1910         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1911    
1912         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1913         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1914    
1915         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1916         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1917         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1918         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1919         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1920         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1921         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1922         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1923         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1924         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1925    
1926     Option bits for pcre_exec()     Option bits for pcre_exec()
1927    
1928         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1929         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1930         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,    PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_START_OPTIMIZE,
1931         PCRE_PARTIAL.         PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
1932    
1933           PCRE_ANCHORED           PCRE_ANCHORED
1934    
1935         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1936         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1937         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1938         unachored at matching time.         unachored at matching time.
1939    
1940             PCRE_BSR_ANYCRLF
1941             PCRE_BSR_UNICODE
1942    
1943           These options (which are mutually exclusive) control what the \R escape
1944           sequence matches. The choice is either to match only CR, LF,  or  CRLF,
1945           or  to  match  any Unicode newline sequence. These options override the
1946           choice that was made or defaulted when the pattern was compiled.
1947    
1948           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1949           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1950           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
# Line 1812  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1956  MATCHING A PATTERN: THE TRADITIONAL FUNC
1956         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion of pcre_compile()  above.  During  matching,  the  newline  choice
1957         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
1958         ters. It may also alter the way the match position is advanced after  a         ters. It may also alter the way the match position is advanced after  a
1959         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,         match failure for an unanchored pattern.
1960         PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a  match  attempt  
1961         fails  when the current position is at a CRLF sequence, the match posi-         When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
1962         tion is advanced by two characters instead of one, in other  words,  to         set, and a match attempt for an unanchored pattern fails when the  cur-
1963         after the CRLF.         rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
1964           explicit matches for  CR  or  LF  characters,  the  match  position  is
1965           advanced by two characters instead of one, in other words, to after the
1966           CRLF.
1967    
1968           The above rule is a compromise that makes the most common cases work as
1969           expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
1970           option is not set), it does not match the string "\r\nA" because, after
1971           failing  at the start, it skips both the CR and the LF before retrying.
1972           However, the pattern [\r\n]A does match that string,  because  it  con-
1973           tains an explicit CR or LF reference, and so advances only by one char-
1974           acter after the first failure.
1975    
1976           An explicit match for CR of LF is either a literal appearance of one of
1977           those  characters,  or  one  of the \r or \n escape sequences. Implicit
1978           matches such as [^X] do not count, nor does \s (which includes  CR  and
1979           LF in the characters that it matches).
1980    
1981           Notwithstanding  the above, anomalous effects may still occur when CRLF
1982           is a valid newline sequence and explicit \r or \n escapes appear in the
1983           pattern.
1984    
1985           PCRE_NOTBOL           PCRE_NOTBOL
1986    
# Line 1856  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2020  MATCHING A PATTERN: THE TRADITIONAL FUNC
2020         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
2021         if  that  fails by advancing the starting offset (see below) and trying         if  that  fails by advancing the starting offset (see below) and trying
2022         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
2023         this in the pcredemo.c sample program.         this in the pcredemo sample program.
2024    
2025             PCRE_NO_START_OPTIMIZE
2026    
2027           There  are a number of optimizations that pcre_exec() uses at the start
2028           of a match, in order to speed up the process. For  example,  if  it  is
2029           known  that  a  match must start with a specific character, it searches
2030           the subject for that character, and fails immediately if it cannot find
2031           it,  without actually running the main matching function. When callouts
2032           are in use, these optimizations can cause  them  to  be  skipped.  This
2033           option  disables  the  "start-up" optimizations, causing performance to
2034           suffer, but ensuring that the callouts do occur.
2035    
2036           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2037    
2038         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2039         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2040         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
2041         points to the start of a UTF-8 character. There is a  discussion  about         points  to  the start of a UTF-8 character. There is a discussion about
2042         the  validity  of  UTF-8 strings in the section on UTF-8 support in the         the validity of UTF-8 strings in the section on UTF-8  support  in  the
2043         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2044         pcre_exec()  returns  the error PCRE_ERROR_BADUTF8. If startoffset con-         pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
2045         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
2046    
2047         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
2048         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
2049         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2050         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
2051         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
2052         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
2053         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
2054         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
2055         value of startoffset that does not point to the start of a UTF-8  char-         value  of startoffset that does not point to the start of a UTF-8 char-
2056         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
2057    
2058           PCRE_PARTIAL           PCRE_PARTIAL_HARD
2059             PCRE_PARTIAL_SOFT
2060    
2061         This  option  turns  on  the  partial  matching feature. If the subject         These options turn on the partial matching feature. For backwards  com-
2062         string fails to match the pattern, but at some point during the  match-         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2063         ing  process  the  end of the subject was reached (that is, the subject         match occurs if the end of the subject string is reached  successfully,
2064         partially matches the pattern and the failure to  match  occurred  only         but  there  are not enough subject characters to complete the match. If
2065         because  there were not enough subject characters), pcre_exec() returns         this happens when PCRE_PARTIAL_HARD  is  set,  pcre_exec()  immediately
2066         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,
2067         used,  there  are restrictions on what may appear in the pattern. These         matching continues by testing any other alternatives. Only if they  all
2068         are discussed in the pcrepartial documentation.         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).
2069           The portion of the string that provided the partial match is set as the
2070           first  matching  string.  There  is  a  more detailed discussion in the
2071           pcrepartial documentation.
2072    
2073     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2074    
2075         The subject string is passed to pcre_exec() as a pointer in subject,  a         The subject string is passed to pcre_exec() as a pointer in subject,  a
2076         length  in  length, and a starting byte offset in startoffset. In UTF-8         length (in bytes) in length, and a starting byte offset in startoffset.
2077         mode, the byte offset must point to the start  of  a  UTF-8  character.         In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
2078         Unlike  the  pattern string, the subject may contain binary zero bytes.         acter.  Unlike  the pattern string, the subject may contain binary zero
2079         When the starting offset is zero, the search for a match starts at  the         bytes. When the starting offset is zero, the search for a match  starts
2080         beginning of the subject, and this is by far the most common case.         at  the  beginning  of  the subject, and this is by far the most common
2081           case.
2082         A  non-zero  starting offset is useful when searching for another match  
2083         in the same subject by calling pcre_exec() again after a previous  suc-         A non-zero starting offset is useful when searching for  another  match
2084         cess.   Setting  startoffset differs from just passing over a shortened         in  the same subject by calling pcre_exec() again after a previous suc-
2085         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         cess.  Setting startoffset differs from just passing over  a  shortened
2086           string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
2087         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2088    
2089           \Biss\B           \Biss\B
2090    
2091         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
2092         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
2093         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
2094         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
2095         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
2096         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2097         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
2098         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2099         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
2100         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2101    
2102         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
2103         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2104         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
2105         subject.         subject.
2106    
2107     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
2108    
2109         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
2110         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
2111         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
2112         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
2113         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
2114         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
2115         that do not cause substrings to be captured.         that do not cause substrings to be captured.
2116    
2117         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of integers
2118         offsets whose address is passed in ovector. The number of  elements  in         whose  address is passed in ovector. The number of elements in the vec-
2119         the  vector is passed in ovecsize, which must be a non-negative number.         tor is passed in ovecsize, which must be a non-negative  number.  Note:
2120         Note: this argument is NOT the size of ovector in bytes.         this argument is NOT the size of ovector in bytes.
2121    
2122         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
2123         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
2124         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
2125         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
2126         The length passed in ovecsize should always be a multiple of three.  If         The  number passed in ovecsize should always be a multiple of three. If
2127         it is not, it is rounded down.         it is not, it is rounded down.
2128    
2129         When  a  match  is successful, information about captured substrings is         When a match is successful, information about  captured  substrings  is
2130         returned in pairs of integers, starting at the  beginning  of  ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
2131         and  continuing  up  to two-thirds of its length at the most. The first         and continuing up to two-thirds of its length at the  most.  The  first
2132         element of a pair is set to the offset of the first character in a sub-         element  of  each pair is set to the byte offset of the first character
2133         string,  and  the  second  is  set to the offset of the first character         in a substring, and the second is set to the byte offset of  the  first
2134         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-         character  after  the end of a substring. Note: these values are always
2135         tor[1],  identify  the  portion  of  the  subject string matched by the         byte offsets, even in UTF-8 mode. They are not character counts.
2136         entire pattern. The next pair is used for the first  capturing  subpat-  
2137         tern, and so on. The value returned by pcre_exec() is one more than the         The first pair of integers, ovector[0]  and  ovector[1],  identify  the
2138         highest numbered pair that has been set. For example, if two substrings         portion  of  the subject string matched by the entire pattern. The next
2139         have  been captured, the returned value is 3. If there are no capturing         pair is used for the first capturing subpattern, and so on.  The  value
2140         subpatterns, the return value from a successful match is 1,  indicating         returned by pcre_exec() is one more than the highest numbered pair that
2141         that just the first pair of offsets has been set.         has been set.  For example, if two substrings have been  captured,  the
2142           returned  value is 3. If there are no capturing subpatterns, the return
2143           value from a successful match is 1, indicating that just the first pair
2144           of offsets has been set.
2145    
2146         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
2147         of the string that it matched that is returned.         of the string that it matched that is returned.
2148    
2149         If the vector is too small to hold all the captured substring  offsets,         If the vector is too small to hold all the captured substring  offsets,
2150         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
2151         function returns a value of zero. In particular, if the substring  off-         function returns a value of zero. If the substring offsets are  not  of
2152         sets are not of interest, pcre_exec() may be called with ovector passed         interest,  pcre_exec()  may  be  called with ovector passed as NULL and
2153         as NULL and ovecsize as zero. However, if  the  pattern  contains  back         ovecsize as zero. However, if the pattern contains back references  and
2154         references  and  the  ovector is not big enough to remember the related         the  ovector is not big enough to remember the related substrings, PCRE
2155         substrings, PCRE has to get additional memory for use during  matching.         has to get additional memory for use during matching. Thus it  is  usu-
2156         Thus it is usually advisable to supply an ovector.         ally advisable to supply an ovector.
2157    
2158         The  pcre_info()  function  can  be used to find out how many capturing         The  pcre_info()  function  can  be used to find out how many capturing
2159         subpatterns there are in a compiled  pattern.  The  smallest  size  for         subpatterns there are in a compiled  pattern.  The  smallest  size  for
# Line 2071  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2254  MATCHING A PATTERN: THE TRADITIONAL FUNC
2254    
2255           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2256    
2257         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         This code is no longer in  use.  It  was  formerly  returned  when  the
2258         items  that are not supported for partial matching. See the pcrepartial         PCRE_PARTIAL  option  was used with a compiled pattern containing items
2259         documentation for details of partial matching.         that were  not  supported  for  partial  matching.  From  release  8.00
2260           onwards, there are no restrictions on partial matching.
2261    
2262           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2263    
2264         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
2265         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2266    
2267           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2268    
2269         This  error is given if the value of the ovecsize argument is negative.         This error is given if the value of the ovecsize argument is negative.
2270    
2271           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2272    
# Line 2232  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2416  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2416         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2417         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
2418    
2419           Warning: If the pattern uses the "(?|" feature to set up multiple  sub-
2420           patterns  with  the  same  number,  you cannot use names to distinguish
2421           them, because names are not included in the compiled code. The matching
2422           process uses only numbers.
2423    
2424    
2425  DUPLICATE SUBPATTERN NAMES  DUPLICATE SUBPATTERN NAMES
2426    
2427         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2428              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2429    
2430         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2431         subpatterns  are  not  required  to  be unique. Normally, patterns with         subpatterns are not required to  be  unique.  Normally,  patterns  with
2432         duplicate names are such that in any one match, only one of  the  named         duplicate  names  are such that in any one match, only one of the named
2433         subpatterns  participates. An example is shown in the pcrepattern docu-         subpatterns participates. An example is shown in the pcrepattern  docu-
2434         mentation.         mentation.
2435    
2436         When   duplicates   are   present,   pcre_copy_named_substring()    and         When    duplicates   are   present,   pcre_copy_named_substring()   and
2437         pcre_get_named_substring()  return the first substring corresponding to         pcre_get_named_substring() return the first substring corresponding  to
2438         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING         the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
2439         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()         (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
2440         function returns one of the numbers that are associated with the  name,         function  returns one of the numbers that are associated with the name,
2441         but it is not defined which it is.         but it is not defined which it is.
2442    
2443         If  you want to get full details of all captured substrings for a given         If you want to get full details of all captured substrings for a  given
2444         name, you must use  the  pcre_get_stringtable_entries()  function.  The         name,  you  must  use  the pcre_get_stringtable_entries() function. The
2445         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2446         third and fourth are pointers to variables which  are  updated  by  the         third  and  fourth  are  pointers to variables which are updated by the
2447         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2448         the name-to-number table  for  the  given  name.  The  function  itself         the  name-to-number  table  for  the  given  name.  The function itself
2449         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
2450         there are none. The format of the table is described above in the  sec-         there  are none. The format of the table is described above in the sec-
2451         tion  entitled  Information  about  a  pattern.  Given all the relevant         tion entitled Information about a  pattern.   Given  all  the  relevant
2452         entries for the name, you can extract each of their numbers, and  hence         entries  for the name, you can extract each of their numbers, and hence
2453         the captured data, if any.         the captured data, if any.
2454    
2455    
2456  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2457    
2458         The  traditional  matching  function  uses a similar algorithm to Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
2459         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2460         the  subject.  If you want to find all possible matches, or the longest         the subject. If you want to find all possible matches, or  the  longest
2461         possible match, consider using the alternative matching  function  (see         possible  match,  consider using the alternative matching function (see
2462         below)  instead.  If you cannot use the alternative function, but still         below) instead. If you cannot use the alternative function,  but  still
2463         need to find all possible matches, you can kludge it up by  making  use         need  to  find all possible matches, you can kludge it up by making use
2464         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2465         tation.         tation.
2466    
2467         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2468         tern.   When your callout function is called, extract and save the cur-         tern.  When your callout function is called, extract and save the  cur-
2469         rent matched substring. Then return  1,  which  forces  pcre_exec()  to         rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2470         backtrack  and  try other alternatives. Ultimately, when it runs out of         backtrack and try other alternatives. Ultimately, when it runs  out  of
2471         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2472    
2473    
# Line 2289  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2478  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2478              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2479              int *workspace, int wscount);              int *workspace, int wscount);
2480    
2481         The function pcre_dfa_exec()  is  called  to  match  a  subject  string         The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2482         against  a  compiled pattern, using a matching algorithm that scans the         against a compiled pattern, using a matching algorithm that  scans  the
2483         subject string just once, and does not backtrack.  This  has  different         subject  string  just  once, and does not backtrack. This has different
2484         characteristics  to  the  normal  algorithm, and is not compatible with         characteristics to the normal algorithm, and  is  not  compatible  with
2485         Perl. Some of the features of PCRE patterns are not  supported.  Never-         Perl.  Some  of the features of PCRE patterns are not supported. Never-
2486         theless,  there are times when this kind of matching can be useful. For         theless, there are times when this kind of matching can be useful.  For
2487         a discussion of the two matching algorithms, see the pcrematching docu-         a discussion of the two matching algorithms, see the pcrematching docu-
2488         mentation.         mentation.
2489    
2490         The  arguments  for  the  pcre_dfa_exec()  function are the same as for         The arguments for the pcre_dfa_exec() function  are  the  same  as  for
2491         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2492         ent  way,  and  this is described below. The other common arguments are         ent way, and this is described below. The other  common  arguments  are
2493         used in the same way as for pcre_exec(), so their  description  is  not         used  in  the  same way as for pcre_exec(), so their description is not
2494         repeated here.         repeated here.
2495    
2496         The  two  additional  arguments provide workspace for the function. The         The two additional arguments provide workspace for  the  function.  The
2497         workspace vector should contain at least 20 elements. It  is  used  for         workspace  vector  should  contain at least 20 elements. It is used for
2498         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2499         workspace will be needed for patterns and subjects where  there  are  a         workspace  will  be  needed for patterns and subjects where there are a
2500         lot of potential matches.         lot of potential matches.
2501    
2502         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2329  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2518  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2518    
2519     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2520    
2521         The  unused  bits  of  the options argument for pcre_dfa_exec() must be         The unused bits of the options argument  for  pcre_dfa_exec()  must  be
2522         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
2523         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,         LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,
2524         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last         PCRE_PARTIAL_HARD,     PCRE_PARTIAL_SOFT,     PCRE_DFA_SHORTEST,    and
2525         three of these are the same as for pcre_exec(), so their description is         PCRE_DFA_RESTART. All but the last four of these are exactly  the  same
2526         not repeated here.         as for pcre_exec(), so their description is not repeated here.
2527    
2528           PCRE_PARTIAL           PCRE_PARTIAL_HARD
2529             PCRE_PARTIAL_SOFT
2530         This has the same general effect as it does for  pcre_exec(),  but  the  
2531         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for         These  have the same general effect as they do for pcre_exec(), but the
2532         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into         details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
2533         PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have         pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
2534         been no complete matches, but there is still at least one matching pos-         ject is reached and there is still at least  one  matching  possibility
2535         sibility.  The portion of the string that provided the partial match is         that requires additional characters. This happens even if some complete
2536         set as the first matching string.         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2537           code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2538           of the subject is reached, there have been  no  complete  matches,  but
2539           there  is  still  at least one matching possibility. The portion of the
2540           string that provided the longest partial match  is  set  as  the  first
2541           matching string in both cases.
2542    
2543           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2544    
2545         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
2546         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2547         tive algorithm works, this is necessarily the shortest  possible  match         tive  algorithm  works, this is necessarily the shortest possible match
2548         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2549    
2550           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2551    
2552         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and         When pcre_dfa_exec() returns a partial match, it is possible to call it
2553         returns a partial match, it is possible to call it  again,  with  addi-         again,  with  additional  subject characters, and have it continue with
2554         tional  subject  characters,  and have it continue with the same match.         the same match. The PCRE_DFA_RESTART option requests this action;  when
2555         The PCRE_DFA_RESTART option requests this action; when it is  set,  the         it  is  set,  the workspace and wscount options must reference the same
2556         workspace  and wscount options must reference the same vector as before         vector as before because data about the match so far is  left  in  them
2557         because data about the match so far is left in  them  after  a  partial         after a partial match. There is more discussion of this facility in the
2558         match.  There  is  more  discussion of this facility in the pcrepartial         pcrepartial documentation.
        documentation.  
2559    
2560     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2561    
# Line 2439  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2632  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2632  SEE ALSO  SEE ALSO
2633    
2634         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2635         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2636    
2637    
2638  AUTHOR  AUTHOR
# Line 2451  AUTHOR Line 2644  AUTHOR
2644    
2645  REVISION  REVISION
2646    
2647         Last updated: 09 August 2007         Last updated: 01 September 2009
2648         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2649  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2650    
2651    
2652  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2653    
2654    
# Line 2503  PCRE CALLOUTS Line 2696  PCRE CALLOUTS
2696  MISSING CALLOUTS  MISSING CALLOUTS
2697    
2698         You  should  be  aware  that,  because of optimizations in the way PCRE         You  should  be  aware  that,  because of optimizations in the way PCRE
2699         matches patterns, callouts sometimes do not happen. For example, if the         matches patterns by default, callouts  sometimes  do  not  happen.  For
2700         pattern is         example, if the pattern is
2701    
2702           ab(?C4)cd           ab(?C4)cd
2703    
# Line 2513  MISSING CALLOUTS Line 2706  MISSING CALLOUTS
2706         ever  start,  and  the  callout is never reached. However, with "abyd",         ever  start,  and  the  callout is never reached. However, with "abyd",
2707         though the result is still no match, the callout is obeyed.         though the result is still no match, the callout is obeyed.
2708    
2709           You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
2710           MIZE  option  to  pcre_exec()  or  pcre_dfa_exec(). This slows down the
2711           matching process, but does ensure that callouts  such  as  the  example
2712           above are obeyed.
2713    
2714    
2715  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
2716    
2717         During matching, when PCRE reaches a callout point, the external  func-         During  matching, when PCRE reaches a callout point, the external func-
2718         tion  defined by pcre_callout is called (if it is set). This applies to         tion defined by pcre_callout is called (if it is set). This applies  to
2719         both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The         both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The
2720         only  argument  to  the callout function is a pointer to a pcre_callout         only argument to the callout function is a pointer  to  a  pcre_callout
2721         block. This structure contains the following fields:         block. This structure contains the following fields:
2722    
2723           int          version;           int          version;
# Line 2535  THE CALLOUT INTERFACE Line 2733  THE CALLOUT INTERFACE
2733           int          pattern_position;           int          pattern_position;
2734           int          next_item_length;           int          next_item_length;
2735    
2736         The version field is an integer containing the version  number  of  the         The  version  field  is an integer containing the version number of the
2737         block  format. The initial version was 0; the current version is 1. The         block format. The initial version was 0; the current version is 1.  The
2738         version number will change again in future  if  additional  fields  are         version  number  will  change  again in future if additional fields are
2739         added, but the intention is never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2740    
2741         The callout_number field contains the number of the  callout,  as  com-         The callout_number field contains the number of the  callout,  as  com-
# Line 2622  AUTHOR Line 2820  AUTHOR
2820    
2821  REVISION  REVISION
2822    
2823         Last updated: 29 May 2007         Last updated: 15 March 2009
2824         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2825  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2826    
2827    
2828  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2829    
2830    
# Line 2639  DIFFERENCES BETWEEN PCRE AND PERL Line 2837  DIFFERENCES BETWEEN PCRE AND PERL
2837         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2838         handle regular expressions. The differences described here  are  mainly         handle regular expressions. The differences described here  are  mainly
2839         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2840         some features that are expected to be in the forthcoming Perl 5.10.         some features that are in Perl 5.10.
2841    
2842         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2843         of  what  it does have are given in the section on UTF-8 support in the         of  what  it does have are given in the section on UTF-8 support in the
# Line 2736  DIFFERENCES BETWEEN PCRE AND PERL Line 2934  DIFFERENCES BETWEEN PCRE AND PERL
2934         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2935         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2936    
2937         (g) The callout facility is PCRE-specific.         (g) The \R escape sequence can be restricted to match only CR,  LF,  or
2938           CRLF by the PCRE_BSR_ANYCRLF option.
2939    
2940         (h) The partial matching facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
2941    
2942         (i) Patterns compiled by PCRE can be saved and re-used at a later time,         (i) The partial matching facility is PCRE-specific.
2943    
2944           (j) Patterns compiled by PCRE can be saved and re-used at a later time,
2945         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2946    
2947         (j)  The  alternative  matching function (pcre_dfa_exec()) matches in a         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
2948         different way and is not Perl-compatible.         different way and is not Perl-compatible.
2949    
2950           (l)  PCRE  recognizes some special sequences such as (*CR) at the start
2951           of a pattern that set overall options that cannot be changed within the
2952           pattern.
2953    
2954    
2955  AUTHOR  AUTHOR
2956    
# Line 2756  AUTHOR Line 2961  AUTHOR
2961    
2962  REVISION  REVISION
2963    
2964         Last updated: 08 August 2007         Last updated: 25 August 2009
2965         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2966  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2967    
2968    
2969  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
2970    
2971    
# Line 2772  PCRE REGULAR EXPRESSION DETAILS Line 2977  PCRE REGULAR EXPRESSION DETAILS
2977    
2978         The  syntax and semantics of the regular expressions that are supported         The  syntax and semantics of the regular expressions that are supported
2979         by PCRE are described in detail below. There is a quick-reference  syn-         by PCRE are described in detail below. There is a quick-reference  syn-
2980         tax  summary  in  the  pcresyntax  page. Perl's regular expressions are         tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
2981         described in its own documentation, and regular expressions in  general         semantics as closely as it can. PCRE  also  supports  some  alternative
2982         are  covered in a number of books, some of which have copious examples.         regular  expression  syntax (which does not conflict with the Perl syn-
2983         Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by         tax) in order to provide some compatibility with regular expressions in
2984         O'Reilly,  covers regular expressions in great detail. This description         Python, .NET, and Oniguruma.
2985         of PCRE's regular expressions is intended as reference material.  
2986           Perl's  regular expressions are described in its own documentation, and
2987           regular expressions in general are covered in a number of  books,  some
2988           of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
2989           Expressions", published by  O'Reilly,  covers  regular  expressions  in
2990           great  detail.  This  description  of  PCRE's  regular  expressions  is
2991           intended as reference material.
2992    
2993         The original operation of PCRE was on strings of  one-byte  characters.         The original operation of PCRE was on strings of  one-byte  characters.
2994         However,  there is now also support for UTF-8 character strings. To use         However,  there is now also support for UTF-8 character strings. To use
2995         this, you must build PCRE to  include  UTF-8  support,  and  then  call         this, you must build PCRE to  include  UTF-8  support,  and  then  call
2996         pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern         pcre_compile()  with  the  PCRE_UTF8  option.  There  is also a special
2997         matching is mentioned in several places below. There is also a  summary         sequence that can be given at the start of a pattern:
2998         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre  
2999         page.           (*UTF8)
3000    
3001           Starting a pattern with this sequence  is  equivalent  to  setting  the
3002           PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
3003           UTF-8 mode affects pattern matching  is  mentioned  in  several  places
3004           below.  There  is  also  a  summary of UTF-8 features in the section on
3005           UTF-8 support in the main pcre page.
3006    
3007         The remainder of this document discusses the  patterns  that  are  sup-         The remainder of this document discusses the  patterns  that  are  sup-
3008         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
# Line 2797  PCRE REGULAR EXPRESSION DETAILS Line 3014  PCRE REGULAR EXPRESSION DETAILS
3014         discussed in the pcrematching page.         discussed in the pcrematching page.
3015    
3016    
3017    NEWLINE CONVENTIONS
3018    
3019           PCRE  supports five different conventions for indicating line breaks in
3020           strings: a single CR (carriage return) character, a  single  LF  (line-
3021           feed) character, the two-character sequence CRLF, any of the three pre-
3022           ceding, or any Unicode newline sequence. The pcreapi page  has  further
3023           discussion  about newlines, and shows how to set the newline convention
3024           in the options arguments for the compiling and matching functions.
3025    
3026           It is also possible to specify a newline convention by starting a  pat-
3027           tern string with one of the following five sequences:
3028    
3029             (*CR)        carriage return
3030             (*LF)        linefeed
3031             (*CRLF)      carriage return, followed by linefeed
3032             (*ANYCRLF)   any of the three above
3033             (*ANY)       all Unicode newline sequences
3034    
3035           These override the default and the options given to pcre_compile(). For
3036           example, on a Unix system where LF is the default newline sequence, the
3037           pattern
3038    
3039             (*CR)a.b
3040    
3041           changes the convention to CR. That pattern matches "a\nb" because LF is
3042           no longer a newline. Note that these special settings,  which  are  not
3043           Perl-compatible,  are  recognized  only at the very start of a pattern,
3044           and that they must be in upper case.  If  more  than  one  of  them  is
3045           present, the last one is used.
3046    
3047           The  newline  convention  does  not  affect what the \R escape sequence
3048           matches. By default, this is any Unicode  newline  sequence,  for  Perl
3049           compatibility.  However, this can be changed; see the description of \R
3050           in the section entitled "Newline sequences" below. A change of \R  set-
3051           ting can be combined with a change of newline convention.
3052    
3053    
3054  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
3055    
3056         A  regular  expression  is  a pattern that is matched against a subject         A  regular  expression  is  a pattern that is matched against a subject
# Line 2852  CHARACTERS AND METACHARACTERS Line 3106  CHARACTERS AND METACHARACTERS
3106                    syntax)                    syntax)
3107           ]      terminates the character class           ]      terminates the character class
3108    
3109         The  following sections describe the use of each of the metacharacters.         The following sections describe the use of each of the metacharacters.
3110    
3111    
3112  BACKSLASH  BACKSLASH
3113    
3114         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
3115         a  non-alphanumeric  character,  it takes away any special meaning that         a non-alphanumeric character, it takes away any  special  meaning  that
3116         character may have. This  use  of  backslash  as  an  escape  character         character  may  have.  This  use  of  backslash  as an escape character
3117         applies both inside and outside character classes.         applies both inside and outside character classes.
3118    
3119         For  example,  if  you want to match a * character, you write \* in the         For example, if you want to match a * character, you write  \*  in  the
3120         pattern.  This escaping action applies whether  or  not  the  following         pattern.   This  escaping  action  applies whether or not the following
3121         character  would  otherwise be interpreted as a metacharacter, so it is         character would otherwise be interpreted as a metacharacter, so  it  is
3122         always safe to precede a non-alphanumeric  with  backslash  to  specify         always  safe  to  precede  a non-alphanumeric with backslash to specify
3123         that  it stands for itself. In particular, if you want to match a back-         that it stands for itself. In particular, if you want to match a  back-
3124         slash, you write \\.         slash, you write \\.
3125    
3126         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
3127         the  pattern (other than in a character class) and characters between a         the pattern (other than in a character class) and characters between  a
3128         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
3129         ing  backslash  can  be  used to include a whitespace or # character as         ing backslash can be used to include a whitespace  or  #  character  as
3130         part of the pattern.         part of the pattern.
3131    
3132         If you want to remove the special meaning from a  sequence  of  charac-         If  you  want  to remove the special meaning from a sequence of charac-
3133         ters,  you can do so by putting them between \Q and \E. This is differ-         ters, you can do so by putting them between \Q and \E. This is  differ-
3134         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
3135         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
3136         tion. Note the following examples:         tion. Note the following examples:
3137    
3138           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2888  BACKSLASH Line 3142  BACKSLASH
3142           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
3143           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3144    
3145         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
3146         classes.         classes.
3147    
3148     Non-printing characters     Non-printing characters
3149    
3150         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
3151         acters in patterns in a visible manner. There is no restriction on  the         acters  in patterns in a visible manner. There is no restriction on the
3152         appearance  of non-printing characters, apart from the binary zero that         appearance of non-printing characters, apart from the binary zero  that
3153         terminates a pattern, but when a pattern  is  being  prepared  by  text         terminates  a  pattern,  but  when  a pattern is being prepared by text
3154         editing,  it  is  usually  easier  to  use  one of the following escape         editing, it is usually easier  to  use  one  of  the  following  escape
3155         sequences than the binary character it represents:         sequences than the binary character it represents:
3156    
3157           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
3158           \cx       "control-x", where x is any character           \cx       "control-x", where x is any character
3159           \e        escape (hex 1B)           \e        escape (hex 1B)
3160           \f        formfeed (hex 0C)           \f        formfeed (hex 0C)
3161           \n        newline (hex 0A)           \n        linefeed (hex 0A)
3162           \r        carriage return (hex 0D)           \r        carriage return (hex 0D)
3163           \t        tab (hex 09)           \t        tab (hex 09)
3164           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
3165           \xhh      character with hex code hh           \xhh      character with hex code hh
3166           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
3167    
3168         The precise effect of \cx is as follows: if x is a lower  case  letter,         The  precise  effect of \cx is as follows: if x is a lower case letter,
3169         it  is converted to upper case. Then bit 6 of the character (hex 40) is         it is converted to upper case. Then bit 6 of the character (hex 40)  is
3170         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
3171         becomes hex 7B.         becomes hex 7B.
3172    
3173         After  \x, from zero to two hexadecimal digits are read (letters can be         After \x, from zero to two hexadecimal digits are read (letters can  be
3174         in upper or lower case). Any number of hexadecimal  digits  may  appear         in  upper  or  lower case). Any number of hexadecimal digits may appear
3175         between  \x{  and  },  but the value of the character code must be less         between \x{ and }, but the value of the character  code  must  be  less
3176         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3177         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger         the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger
3178         than the largest Unicode code point, which is 10FFFF.         than the largest Unicode code point, which is 10FFFF.
3179    
3180         If characters other than hexadecimal digits appear between \x{  and  },         If  characters  other than hexadecimal digits appear between \x{ and },
3181         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
3182         Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal         Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
3183         escape,  with  no  following  digits, giving a character whose value is         escape, with no following digits, giving a  character  whose  value  is
3184         zero.         zero.
3185    
3186         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3187         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x. There is no difference in the way  they  are  han-
3188         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
3189    
3190         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
3191         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
3192         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
3193         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
3194         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
3195    
3196         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
3197         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
3198         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
3199         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
3200         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
3201         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
3202         of parenthesized subpatterns.         of parenthesized subpatterns.
3203    
3204         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
3205         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
3206         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
3207         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves.  In
3208         non-UTF-8 mode, the value of a character specified  in  octal  must  be         non-UTF-8  mode,  the  value  of a character specified in octal must be
3209         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
3210         example:         example:
3211    
3212           \040   is another way of writing a space           \040   is another way of writing a space
# Line 2970  BACKSLASH Line 3224  BACKSLASH
3224           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
3225                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
3226    
3227         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
3228         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
3229    
3230         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
3231         inside and outside character classes. In addition, inside  a  character         inside  and  outside character classes. In addition, inside a character
3232         class,  the  sequence \b is interpreted as the backspace character (hex         class, the sequence \b is interpreted as the backspace  character  (hex
3233         08), and the sequences \R and \X are interpreted as the characters  "R"         08),  and the sequences \R and \X are interpreted as the characters "R"
3234         and  "X", respectively. Outside a character class, these sequences have         and "X", respectively. Outside a character class, these sequences  have
3235         different meanings (see below).         different meanings (see below).
3236    
3237     Absolute and relative back references     Absolute and relative back references
3238    
3239         The sequence \g followed by an unsigned or a negative  number,  option-         The  sequence  \g followed by an unsigned or a negative number, option-
3240         ally  enclosed  in braces, is an absolute or relative back reference. A         ally enclosed in braces, is an absolute or relative back  reference.  A
3241         named back reference can be coded as \g{name}. Back references are dis-         named back reference can be coded as \g{name}. Back references are dis-
3242         cussed later, following the discussion of parenthesized subpatterns.         cussed later, following the discussion of parenthesized subpatterns.
3243    
3244       Absolute and relative subroutine calls
3245    
3246           For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
3247           name or a number enclosed either in angle brackets or single quotes, is
3248           an alternative syntax for referencing a subpattern as  a  "subroutine".
3249           Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and
3250           \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back
3251           reference; the latter is a subroutine call.
3252    
3253     Generic character types     Generic character types
3254    
3255         Another use of backslash is for specifying generic character types. The         Another use of backslash is for specifying generic character types. The
# Line 3022  BACKSLASH Line 3285  BACKSLASH
3285         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
3286         code  character  property  support is available. These sequences retain         code  character  property  support is available. These sequences retain
3287         their original meanings from before UTF-8 support was available, mainly         their original meanings from before UTF-8 support was available, mainly
3288         for efficiency reasons.         for  efficiency  reasons. Note that this also affects \b, because it is
3289           defined in terms of \w and \W.
3290    
3291         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
3292         the other sequences, these do match certain high-valued  codepoints  in         the  other  sequences, these do match certain high-valued codepoints in
3293         UTF-8 mode.  The horizontal space characters are:         UTF-8 mode.  The horizontal space characters are:
3294    
3295           U+0009     Horizontal tab           U+0009     Horizontal tab
# Line 3059  BACKSLASH Line 3323  BACKSLASH
3323           U+2029     Paragraph separator           U+2029     Paragraph separator
3324    
3325         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
3326         is a letter or digit. The definition of  letters  and  digits  is  con-         is  a  letter  or  digit.  The definition of letters and digits is con-
3327         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
3328         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3329         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3330         systems, or "french" in Windows, some character codes greater than  128         systems,  or "french" in Windows, some character codes greater than 128
3331         are  used for accented letters, and these are matched by \w. The use of         are used for accented letters, and these are matched by \w. The use  of
3332         locales with Unicode is discouraged.         locales with Unicode is discouraged.
3333    
3334     Newline sequences     Newline sequences
3335    
3336         Outside a character class, the escape sequence \R matches  any  Unicode         Outside  a  character class, by default, the escape sequence \R matches
3337         newline  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is         any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
3338         equivalent to the following:         mode \R is equivalent to the following:
3339    
3340           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3341    
3342         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
3343         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3344         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3345         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3346         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3347         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3348    
3349         In  UTF-8  mode, two additional characters whose codepoints are greater         In UTF-8 mode, two additional characters whose codepoints  are  greater
3350         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3351         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
3352         these characters to be recognized.         these characters to be recognized.
3353    
3354           It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3355           the complete set  of  Unicode  line  endings)  by  setting  the  option
3356           PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3357           (BSR is an abbrevation for "backslash R".) This can be made the default
3358           when  PCRE  is  built;  if this is the case, the other behaviour can be
3359           requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
3360           specify  these  settings  by  starting a pattern string with one of the
3361           following sequences:
3362    
3363             (*BSR_ANYCRLF)   CR, LF, or CRLF only
3364             (*BSR_UNICODE)   any Unicode newline sequence
3365    
3366           These override the default and the options given to pcre_compile(), but
3367           they can be overridden by options given to pcre_exec(). Note that these
3368           special settings, which are not Perl-compatible, are recognized only at
3369           the  very  start  of a pattern, and that they must be in upper case. If
3370           more than one of them is present, the last one is  used.  They  can  be
3371           combined  with  a  change of newline convention, for example, a pattern
3372           can start with:
3373    
3374             (*ANY)(*BSR_ANYCRLF)
3375    
3376         Inside a character class, \R matches the letter "R".         Inside a character class, \R matches the letter "R".
3377    
3378     Unicode character properties     Unicode character properties
# Line 3526  POSIX CHARACTER CLASSES Line 3812  POSIX CHARACTER CLASSES
3812    
3813  VERTICAL BAR  VERTICAL BAR
3814    
3815         Vertical  bar characters are used to separate alternative patterns. For         Vertical bar characters are used to separate alternative patterns.  For
3816         example, the pattern         example, the pattern
3817    
3818           gilbert|sullivan           gilbert|sullivan
3819    
3820         matches either "gilbert" or "sullivan". Any number of alternatives  may         matches  either "gilbert" or "sullivan". Any number of alternatives may
3821         appear,  and  an  empty  alternative  is  permitted (matching the empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
3822         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
3823         to  right, and the first one that succeeds is used. If the alternatives         to right, and the first one that succeeds is used. If the  alternatives
3824         are within a subpattern (defined below), "succeeds" means matching  the         are  within a subpattern (defined below), "succeeds" means matching the
3825         rest  of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
3826    
3827    
3828  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
3829    
3830         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
3831         PCRE_EXTENDED  options  can  be  changed  from  within the pattern by a         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
3832         sequence of Perl option letters enclosed  between  "(?"  and  ")".  The         within the pattern by  a  sequence  of  Perl  option  letters  enclosed
3833         option letters are         between "(?" and ")".  The option letters are
3834    
3835           i  for PCRE_CASELESS           i  for PCRE_CASELESS
3836           m  for PCRE_MULTILINE           m  for PCRE_MULTILINE
# Line 3558  INTERNAL OPTION SETTING Line 3844  INTERNAL OPTION SETTING
3844         is  also  permitted.  If  a  letter  appears  both before and after the         is  also  permitted.  If  a  letter  appears  both before and after the
3845         hyphen, the option is unset.         hyphen, the option is unset.
3846    
3847         When an option change occurs at top level (that is, not inside  subpat-         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
3848         tern  parentheses),  the change applies to the remainder of the pattern         can  be changed in the same way as the Perl-compatible options by using
3849         that follows.  If the change is placed right at the start of a pattern,         the characters J, U and X respectively.
3850         PCRE extracts it into the global options (and it will therefore show up  
3851         in data extracted by the pcre_fullinfo() function).         When one of these option changes occurs at  top  level  (that  is,  not
3852           inside  subpattern parentheses), the change applies to the remainder of
3853           the pattern that follows. If the change is placed right at the start of
3854           a pattern, PCRE extracts it into the global options (and it will there-
3855           fore show up in data extracted by the pcre_fullinfo() function).
3856    
3857         An option change within a subpattern (see below for  a  description  of         An option change within a subpattern (see below for  a  description  of
3858         subpatterns) affects only that part of the current pattern that follows         subpatterns) affects only that part of the current pattern that follows
# Line 3583  INTERNAL OPTION SETTING Line 3873  INTERNAL OPTION SETTING
3873         the effects of option settings happen at compile time. There  would  be         the effects of option settings happen at compile time. There  would  be
3874         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3875    
3876         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA         Note:  There  are  other  PCRE-specific  options that can be set by the
3877         can be changed in the same way as the Perl-compatible options by  using         application when the compile or match functions  are  called.  In  some
3878         the characters J, U and X respectively.         cases the pattern can contain special leading sequences such as (*CRLF)
3879           to override what the application has set or what  has  been  defaulted.
3880           Details  are  given  in the section entitled "Newline sequences" above.
3881           There is also the (*UTF8) leading sequence that  can  be  used  to  set
3882           UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
3883    
3884    
3885  SUBPATTERNS  SUBPATTERNS
# Line 3724  NAMED SUBPATTERNS Line 4018  NAMED SUBPATTERNS
4018         lowest  number  is used. For further details of the interfaces for han-         lowest  number  is used. For further details of the interfaces for han-
4019         dling named subpatterns, see the pcreapi documentation.         dling named subpatterns, see the pcreapi documentation.
4020    
4021           Warning: You cannot use different names to distinguish between two sub-
4022           patterns  with  the same number (see the previous section) because PCRE
4023           uses only the numbers when matching.
4024    
4025    
4026  REPETITION  REPETITION
4027    
# Line 3764  REPETITION Line 4062  REPETITION
4062         the syntax of a quantifier, is taken as a literal character. For  exam-         the syntax of a quantifier, is taken as a literal character. For  exam-
4063         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
4064    
4065         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
4066         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
4067         acters, each of which is represented by a two-byte sequence. Similarly,         acters, each of which is represented by a two-byte sequence. Similarly,
4068         when Unicode property support is available, \X{3} matches three Unicode         when Unicode property support is available, \X{3} matches three Unicode
4069         extended sequences, each of which may be several bytes long  (and  they         extended  sequences,  each of which may be several bytes long (and they
4070         may be of different lengths).         may be of different lengths).
4071    
4072         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
4073         the previous item and the quantifier were not present.         the previous item and the quantifier were not present. This may be use-
4074           ful for subpatterns that are referenced as subroutines  from  elsewhere
4075           in the pattern. Items other than subpatterns that have a {0} quantifier
4076           are omitted from the compiled pattern.
4077    
4078         For convenience, the three most common quantifiers have  single-charac-         For convenience, the three most common quantifiers have  single-charac-
4079         ter abbreviations:         ter abbreviations:
# Line 3902  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 4203  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
4203    
4204           (?>\d+)foo           (?>\d+)foo
4205    
4206         This kind of parenthesis "locks up" the  part of the  pattern  it  con-         This  kind  of  parenthesis "locks up" the  part of the pattern it con-
4207         tains  once  it  has matched, and a failure further into the pattern is         tains once it has matched, and a failure further into  the  pattern  is
4208         prevented from backtracking into it. Backtracking past it  to  previous         prevented  from  backtracking into it. Backtracking past it to previous
4209         items, however, works as normal.         items, however, works as normal.
4210    
4211         An  alternative  description  is that a subpattern of this type matches         An alternative description is that a subpattern of  this  type  matches
4212         the string of characters that an  identical  standalone  pattern  would         the  string  of  characters  that an identical standalone pattern would
4213         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
4214    
4215         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
4216         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
4217         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
4218         pared to adjust the number of digits they match in order  to  make  the         pared  to  adjust  the number of digits they match in order to make the
4219         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
4220         digits.         digits.
4221    
4222         Atomic groups in general can of course contain arbitrarily  complicated         Atomic  groups in general can of course contain arbitrarily complicated
4223         subpatterns,  and  can  be  nested. However, when the subpattern for an         subpatterns, and can be nested. However, when  the  subpattern  for  an
4224         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
4225         simpler  notation,  called  a "possessive quantifier" can be used. This         simpler notation, called a "possessive quantifier" can  be  used.  This
4226         consists of an additional + character  following  a  quantifier.  Using         consists  of  an  additional  + character following a quantifier. Using
4227         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
4228    
4229           \d++foo           \d++foo
# Line 3932  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 4233  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
4233    
4234           (abc|xyz){2,3}+           (abc|xyz){2,3}+
4235    
4236         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive   quantifiers   are   always  greedy;  the  setting  of  the
4237         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
4238         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
4239         meaning  of  a  possessive  quantifier and the equivalent atomic group,         meaning of a possessive quantifier and  the  equivalent  atomic  group,
4240         though there may be a performance  difference;  possessive  quantifiers         though  there  may  be a performance difference; possessive quantifiers
4241         should be slightly faster.         should be slightly faster.
4242    
4243         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
4244         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
4245         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
4246         built Sun's Java package, and PCRE copied it from there. It  ultimately         built  Sun's Java package, and PCRE copied it from there. It ultimately
4247         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
4248    
4249         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
4250         ple pattern constructs. For example, the sequence  A+B  is  treated  as         ple  pattern  constructs.  For  example, the sequence A+B is treated as
4251         A++B  because  there is no point in backtracking into a sequence of A's         A++B because there is no point in backtracking into a sequence  of  A's
4252         when B must follow.         when B must follow.
4253    
4254         When a pattern contains an unlimited repeat inside  a  subpattern  that         When  a  pattern  contains an unlimited repeat inside a subpattern that
4255         can  itself  be  repeated  an  unlimited number of times, the use of an         can itself be repeated an unlimited number of  times,  the  use  of  an
4256         atomic group is the only way to avoid some  failing  matches  taking  a         atomic  group  is  the  only way to avoid some failing matches taking a
4257         very long time indeed. The pattern         very long time indeed. The pattern
4258    
4259           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
4260    
4261         matches  an  unlimited number of substrings that either consist of non-         matches an unlimited number of substrings that either consist  of  non-
4262         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
4263         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
4264    
4265           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4266    
4267         it  takes  a  long  time  before reporting failure. This is because the         it takes a long time before reporting  failure.  This  is  because  the
4268         string can be divided between the internal \D+ repeat and the  external         string  can be divided between the internal \D+ repeat and the external
4269         *  repeat  in  a  large  number of ways, and all have to be tried. (The         * repeat in a large number of ways, and all  have  to  be  tried.  (The
4270         example uses [!?] rather than a single character at  the  end,  because         example  uses  [!?]  rather than a single character at the end, because
4271         both  PCRE  and  Perl have an optimization that allows for fast failure         both PCRE and Perl have an optimization that allows  for  fast  failure
4272         when a single character is used. They remember the last single  charac-         when  a single character is used. They remember the last single charac-
4273         ter  that  is required for a match, and fail early if it is not present         ter that is required for a match, and fail early if it is  not  present
4274         in the string.) If the pattern is changed so that  it  uses  an  atomic         in  the  string.)  If  the pattern is changed so that it uses an atomic
4275         group, like this:         group, like this:
4276    
4277           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
4278    
4279         sequences  of non-digits cannot be broken, and failure happens quickly.         sequences of non-digits cannot be broken, and failure happens quickly.
4280    
4281    
4282  BACK REFERENCES  BACK REFERENCES
# Line 4550  SUBPATTERNS AS SUBROUTINES Line 4851  SUBPATTERNS AS SUBROUTINES
4851         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
4852    
4853    
4854    ONIGURUMA SUBROUTINE SYNTAX
4855    
4856           For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
4857           name or a number enclosed either in angle brackets or single quotes, is
4858           an  alternative  syntax  for  referencing a subpattern as a subroutine,
4859           possibly recursively. Here are two of the examples used above,  rewrit-
4860           ten using this syntax:
4861    
4862             (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4863             (sens|respons)e and \g'1'ibility
4864    
4865           PCRE  supports  an extension to Oniguruma: if a number is preceded by a
4866           plus or a minus sign it is taken as a relative reference. For example:
4867    
4868             (abc)(?i:\g<-1>)
4869    
4870           Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
4871           synonymous.  The former is a back reference; the latter is a subroutine
4872           call.
4873    
4874    
4875  CALLOUTS  CALLOUTS
4876    
4877         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4878         Perl code to be obeyed in the middle of matching a regular  expression.         Perl  code to be obeyed in the middle of matching a regular expression.
4879         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
4880         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
4881         tion.         tion.
4882    
4883         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
4884         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
4885         an  external function by putting its entry point in the global variable         an external function by putting its entry point in the global  variable
4886         pcre_callout.  By default, this variable contains NULL, which  disables         pcre_callout.   By default, this variable contains NULL, which disables
4887         all calling out.         all calling out.
4888    
4889         Within  a  regular  expression,  (?C) indicates the points at which the         Within a regular expression, (?C) indicates the  points  at  which  the
4890         external function is to be called. If you want  to  identify  different         external  function  is  to be called. If you want to identify different
4891         callout  points, you can put a number less than 256 after the letter C.         callout points, you can put a number less than 256 after the letter  C.
4892         The default value is zero.  For example, this pattern has  two  callout         The  default  value is zero.  For example, this pattern has two callout
4893         points:         points:
4894    
4895           (?C1)abc(?C2)def           (?C1)abc(?C2)def
4896    
4897         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4898         automatically installed before each item in the pattern. They  are  all         automatically  installed  before each item in the pattern. They are all
4899         numbered 255.         numbered 255.
4900    
4901         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
4902         set), the external function is called. It is provided with  the  number         set),  the  external function is called. It is provided with the number
4903         of  the callout, the position in the pattern, and, optionally, one item         of the callout, the position in the pattern, and, optionally, one  item
4904         of data originally supplied by the caller of pcre_exec().  The  callout         of  data  originally supplied by the caller of pcre_exec(). The callout
4905         function  may cause matching to proceed, to backtrack, or to fail alto-         function may cause matching to proceed, to backtrack, or to fail  alto-
4906         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4907         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4908    
4909    
4910  BACTRACKING CONTROL  BACKTRACKING CONTROL
4911    
4912         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
4913         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
4914         ject  to  change or removal in a future version of Perl". It goes on to         ject to change or removal in a future version of Perl". It goes  on  to
4915         say: "Their usage in production code should be noted to avoid  problems         say:  "Their usage in production code should be noted to avoid problems
4916         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
4917         in this section.         in this section.
4918    
4919         Since these verbs are specifically related to backtracking, they can be         Since  these  verbs  are  specifically related to backtracking, most of
4920         used  only  when  the pattern is to be matched using pcre_exec(), which         them can be  used  only  when  the  pattern  is  to  be  matched  using
4921         uses a backtracking algorithm. They cause an error  if  encountered  by         pcre_exec(), which uses a backtracking algorithm. With the exception of
4922         pcre_dfa_exec().         (*FAIL), which behaves like a failing negative assertion, they cause an
4923           error if encountered by pcre_dfa_exec().
4924    
4925         The  new verbs make use of what was previously invalid syntax: an open-         The  new verbs make use of what was previously invalid syntax: an open-
4926         ing parenthesis followed by an asterisk. In Perl, they are generally of         ing parenthesis followed by an asterisk. In Perl, they are generally of
# Line 4716  AUTHOR Line 5039  AUTHOR
5039    
5040  REVISION  REVISION
5041    
5042         Last updated: 09 August 2007         Last updated: 11 April 2009
5043         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5044  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5045    
5046    
5047  PCRESYNTAX(3)                                                    PCRESYNTAX(3)  PCRESYNTAX(3)                                                    PCRESYNTAX(3)
5048    
5049    
# Line 4829  GENERAL CATEGORY PROPERTY CODES FOR \p a Line 5152  GENERAL CATEGORY PROPERTY CODES FOR \p a
5152  SCRIPT NAMES FOR \p AND \P  SCRIPT NAMES FOR \p AND \P
5153    
5154         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
5155         Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,         Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu-
5156         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,         neiform,  Cypriot,  Cyrillic,  Deseret, Devanagari, Ethiopic, Georgian,
5157         Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-         Glagolitic, Gothic, Greek, Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,
5158         gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,         Hebrew,  Hiragana,  Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi,
5159         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,         Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian,  Malayalam,
5160         Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,         Mongolian,  Myanmar,  New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian,
5161         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,         Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash-
5162         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.         tra,  Shavian,  Sinhala,  Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag-
5163           banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
5164           Ugaritic, Vai, Yi.
5165    
5166    
5167  CHARACTER CLASSES  CHARACTER CLASSES
# Line 4845  CHARACTER CLASSES Line 5170  CHARACTER CLASSES
5170           [^...]      negative character class           [^...]      negative character class
5171           [x-y]       range (can be used for hex characters)           [x-y]       range (can be used for hex characters)
5172           [[:xxx:]]   positive POSIX named set           [[:xxx:]]   positive POSIX named set
5173           [[^:xxx:]]  negative POSIX named set           [[:^xxx:]]  negative POSIX named set
5174    
5175           alnum       alphanumeric           alnum       alphanumeric
5176           alpha       alphabetic           alpha       alphabetic
# Line 4888  QUANTIFIERS Line 5213  QUANTIFIERS
5213    
5214  ANCHORS AND SIMPLE ASSERTIONS  ANCHORS AND SIMPLE ASSERTIONS
5215    
5216           \b          word boundary           \b          word boundary (only ASCII letters recognized)
5217           \B          not a word boundary           \B          not a word boundary
5218           ^           start of subject           ^           start of subject
5219                        also after internal newline in multiline mode                        also after internal newline in multiline mode
# Line 4914  ALTERNATION Line 5239  ALTERNATION
5239    
5240  CAPTURING  CAPTURING
5241    
5242           (...)          capturing group           (...)           capturing group
5243           (?<name>...)   named capturing group (Perl)           (?<name>...)    named capturing group (Perl)
5244           (?'name'...)   named capturing group (Perl)           (?'name'...)    named capturing group (Perl)
5245           (?P<name>...)  named capturing group (Python)           (?P<name>...)   named capturing group (Python)
5246           (?:...)        non-capturing group           (?:...)         non-capturing group
5247           (?|...)        non-capturing group; reset group numbers for           (?|...)         non-capturing group; reset group numbers for
5248                           capturing groups in each alternative                            capturing groups in each alternative
5249    
5250    
5251  ATOMIC GROUPS  ATOMIC GROUPS
5252    
5253           (?>...)        atomic, non-capturing group           (?>...)         atomic, non-capturing group
5254    
5255    
5256  COMMENT  COMMENT
5257    
5258           (?#....)       comment (not nestable)           (?#....)        comment (not nestable)
5259    
5260    
5261  OPTION SETTING  OPTION SETTING
5262    
5263           (?i)           caseless           (?i)            caseless
5264           (?J)           allow duplicate names           (?J)            allow duplicate names
5265           (?m)           multiline           (?m)            multiline
5266           (?s)           single line (dotall)           (?s)            single line (dotall)
5267           (?U)           default ungreedy (lazy)           (?U)            default ungreedy (lazy)
5268           (?x)           extended (ignore white space)           (?x)            extended (ignore white space)
5269           (?-...)        unset option(s)           (?-...)         unset option(s)
5270    
5271           The following is recognized only at the start of a pattern or after one
5272           of the newline-setting options with similar syntax:
5273    
5274             (*UTF8)         set UTF-8 mode
5275    
5276    
5277  LOOKAHEAD AND LOOKBEHIND ASSERTIONS  LOOKAHEAD AND LOOKBEHIND ASSERTIONS
5278    
5279           (?=...)        positive look ahead           (?=...)         positive look ahead
5280           (?!...)        negative look ahead           (?!...)         negative look ahead
5281           (?<=...)       positive look behind           (?<=...)        positive look behind
5282           (?<!...)       negative look behind           (?<!...)        negative look behind
5283    
5284         Each top-level branch of a look behind must be of a fixed length.         Each top-level branch of a look behind must be of a fixed length.
5285    
5286    
5287  BACKREFERENCES  BACKREFERENCES
5288    
5289           \n             reference by number (can be ambiguous)           \n              reference by number (can be ambiguous)
5290           \gn            reference by number           \gn             reference by number
5291           \g{n}          reference by number           \g{n}           reference by number
5292           \g{-n}         relative reference by number           \g{-n}          relative reference by number
5293           \k<name>       reference by name (Perl)           \k<name>        reference by name (Perl)
5294           \k'name'       reference by name (Perl)           \k'name'        reference by name (Perl)
5295           \g{name}       reference by name (Perl)           \g{name}        reference by name (Perl)
5296           \k{name}       reference by name (.NET)           \k{name}        reference by name (.NET)
5297           (?P=name)      reference by name (Python)           (?P=name)       reference by name (Python)
5298    
5299    
5300  SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)  SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
5301    
5302           (?R)           recurse whole pattern           (?R)            recurse whole pattern
5303           (?n)           call subpattern by absolute number           (?n)            call subpattern by absolute number
5304           (?+n)          call subpattern by relative number           (?+n)           call subpattern by relative number
5305           (?-n)          call subpattern by relative number           (?-n)           call subpattern by relative number
5306           (?&name)       call subpattern by name (Perl)           (?&name)        call subpattern by name (Perl)
5307           (?P>name)      call subpattern by name (Python)           (?P>name)       call subpattern by name (Python)
5308             \g<name>        call subpattern by name (Oniguruma)
5309             \g'name'        call subpattern by name (Oniguruma)
5310             \g<n>           call subpattern by absolute number (Oniguruma)
5311             \g'n'           call subpattern by absolute number (Oniguruma)
5312             \g<+n>          call subpattern by relative number (PCRE extension)
5313             \g'+n'          call subpattern by relative number (PCRE extension)
5314             \g<-n>          call subpattern by relative number (PCRE extension)
5315             \g'-n'          call subpattern by relative number (PCRE extension)
5316    
5317    
5318  CONDITIONAL PATTERNS  CONDITIONAL PATTERNS
# Line 4982  CONDITIONAL PATTERNS Line 5320  CONDITIONAL PATTERNS
5320           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5321           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5322    
5323           (?(n)...       absolute reference condition           (?(n)...        absolute reference condition
5324           (?(+n)...      relative reference condition           (?(+n)...       relative reference condition
5325           (?(-n)...      relative reference condition           (?(-n)...       relative reference condition
5326           (?(<name>)...  named reference condition (Perl)           (?(<name>)...   named reference condition (Perl)
5327           (?('name')...  named reference condition (Perl)           (?('name')...   named reference condition (Perl)
5328           (?(name)...    named reference condition (PCRE)           (?(name)...     named reference condition (PCRE)
5329           (?(R)...       overall recursion condition           (?(R)...        overall recursion condition
5330           (?(Rn)...      specific group recursion condition           (?(Rn)...       specific group recursion condition
5331           (?(R&name)...  specific recursion condition           (?(R&name)...   specific recursion condition
5332           (?(DEFINE)...  define subpattern for reference           (?(DEFINE)...   define subpattern for reference
5333           (?(assert)...  assertion condition           (?(assert)...   assertion condition
5334    
5335    
5336  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5337    
5338         The following act immediately they are reached:         The following act immediately they are reached:
5339    
5340           (*ACCEPT)      force successful match           (*ACCEPT)       force successful match
5341           (*FAIL)        force backtrack; synonym (*F)           (*FAIL)         force backtrack; synonym (*F)
5342    
5343         The following act only when a subsequent match failure causes  a  back-         The  following  act only when a subsequent match failure causes a back-
5344         track to reach them. They all force a match failure, but they differ in         track to reach them. They all force a match failure, but they differ in
5345         what happens afterwards. Those that advance the start-of-match point do         what happens afterwards. Those that advance the start-of-match point do
5346         so only if the pattern is not anchored.         so only if the pattern is not anchored.
5347    
5348           (*COMMIT)      overall failure, no advance of starting point           (*COMMIT)       overall failure, no advance of starting point
5349           (*PRUNE)       advance to next starting character           (*PRUNE)        advance to next starting character
5350           (*SKIP)        advance start to current matching position           (*SKIP)         advance start to current matching position
5351           (*THEN)        local failure, backtrack to next alternation           (*THEN)         local failure, backtrack to next alternation
5352    
5353    
5354    NEWLINE CONVENTIONS
5355    
5356           These are recognized only at the very start of the pattern or  after  a
5357           (*BSR_...) or (*UTF8) option.
5358    
5359             (*CR)           carriage return only
5360             (*LF)           linefeed only
5361             (*CRLF)         carriage return followed by linefeed
5362             (*ANYCRLF)      all three of the above
5363             (*ANY)          any Unicode newline sequence
5364    
5365    
5366    WHAT \R MATCHES
5367    
5368           These  are  recognized only at the very start of the pattern or after a
5369           (*...) option that sets the newline convention or UTF-8 mode.
5370    
5371             (*BSR_ANYCRLF)  CR, LF, or CRLF
5372             (*BSR_UNICODE)  any Unicode newline sequence
5373    
5374    
5375  CALLOUTS  CALLOUTS
# Line 5033  AUTHOR Line 5392  AUTHOR
5392    
5393  REVISION  REVISION
5394    
5395         Last updated: 08 August 2007         Last updated: 11 April 2009
5396         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5397  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5398    
5399    
5400  PCREPARTIAL(3)                                                  PCREPARTIAL(3)  PCREPARTIAL(3)                                                  PCREPARTIAL(3)
5401    
5402    
# Line 5061  PARTIAL MATCHING IN PCRE Line 5420  PARTIAL MATCHING IN PCRE
5420    
5421         If the application sees the user's keystrokes one by one, and can check         If the application sees the user's keystrokes one by one, and can check
5422         that what has been typed so far is potentially valid,  it  is  able  to         that what has been typed so far is potentially valid,  it  is  able  to
5423         raise  an  error as soon as a mistake is made, possibly beeping and not         raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
5424         reflecting the character that has been typed. This  immediate  feedback         reflecting the character that has been typed, for example. This immedi-
5425         is  likely  to  be a better user interface than a check that is delayed         ate  feedback is likely to be a better user interface than a check that
5426         until the entire string has been entered.         is delayed until the entire string has been entered.  Partial  matching
5427           can  also  sometimes be useful when the subject string is very long and
5428         PCRE supports the concept of partial matching by means of the PCRE_PAR-         is not all available at once.
5429         TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or  
5430         pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code         PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
5431         PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time         PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
5432         during the matching process the last part of the subject string matched         pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
5433         part  of  the  pattern. Unfortunately, for non-anchored matching, it is         for PCRE_PARTIAL_SOFT. The essential difference between the two options
5434         not possible to obtain the position of the start of the partial  match.         is whether or not a partial match is preferred to an  alternative  com-
5435         No captured data is set when PCRE_ERROR_PARTIAL is returned.         plete  match,  though the details differ between the two matching func-
5436           tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
5437         When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code  
5438         PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of         Setting a partial matching option disables one of PCRE's optimizations.
5439         the  subject is reached, there have been no complete matches, but there         PCRE  remembers the last literal byte in a pattern, and abandons match-
5440         is still at least one matching possibility. The portion of  the  string         ing immediately if such a byte is not present in  the  subject  string.
5441         that provided the partial match is set as the first matching string.         This  optimization cannot be used for a subject string that might match
5442           only partially.
5443         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers  
5444         the last literal byte in a pattern, and abandons  matching  immediately  
5445         if  such a byte is not present in the subject string. This optimization  PARTIAL MATCHING USING pcre_exec()
5446         cannot be used for a subject string that might match only partially.  
5447           A partial match occurs during a call to pcre_exec() whenever the end of
5448           the  subject  string  is reached successfully, but matching cannot con-
5449  RESTRICTED PATTERNS FOR PCRE_PARTIAL         tinue because more characters are needed. However, at least one charac-
5450           ter  must have been matched. (In other words, a partial match can never
5451         Because of the way certain internal optimizations  are  implemented  in         be an empty string.)
5452         the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with  
5453         all patterns. These restrictions do not apply when  pcre_dfa_exec()  is         If PCRE_PARTIAL_SOFT is set,  the  partial  match  is  remembered,  but
5454         used.  For pcre_exec(), repeated single characters such as         matching continues as normal, and other alternatives in the pattern are
5455           tried.  If  no  complete  match  can  be  found,  pcre_exec()   returns
5456           a{2,4}         PCRE_ERROR_PARTIAL  instead  of PCRE_ERROR_NOMATCH, and if there are at
5457           least two slots in the offsets vector, they are filled in with the off-
5458         and repeated single metasequences such as         sets  of  the longest string that partially matched. Consider this pat-
5459           tern:
5460           \d+  
5461             /123\w+X|dogY/
5462         are  not permitted if the maximum number of occurrences is greater than  
5463         one.  Optional items such as \d? (where the maximum is one) are permit-         If this is matched against the subject string "abc123dog", both  alter-
5464         ted.   Quantifiers  with any values are permitted after parentheses, so         natives  fail  to  match,  but the end of the subject is reached during
5465         the invalid examples above can be coded thus:         matching,   so    PCRE_ERROR_PARTIAL    is    returned    instead    of
5466           PCRE_ERROR_NOMATCH.  The  offsets  are  set  to  3  and  9, identifying
5467           (a){2,4}         "123dog" as the longest partial match that was found. (In this example,
5468           (\d)+         there  are  two  partial  matches,  because  "dog" on its own partially
5469           matches the second alternative.)
5470         These constructions run more slowly, but for the kinds  of  application  
5471         that  are  envisaged  for this facility, this is not felt to be a major         If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
5472         restriction.         TIAL  as soon as a partial match is found, without continuing to search
5473           for possible complete matches. The difference between the  two  options
5474         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the         can be illustrated by a pattern such as:
5475         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL  
5476         (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to           /dog(sbody)?/
5477         find out if a compiled pattern can be used for partial matching.  
5478           This  matches either "dog" or "dogsbody", greedily (that is, it prefers
5479           the longer string if possible). If it is  matched  against  the  string
5480           "dog"  with  PCRE_PARTIAL_SOFT,  it  yields a complete match for "dog".
5481           However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
5482           On  the  other hand, if the pattern is made ungreedy the result is dif-
5483           ferent:
5484    
5485             /dog(sbody)??/
5486    
5487           In this case the result is always a complete match because  pcre_exec()
5488           finds  that  first,  and  it  never continues after finding a match. It
5489           might be easier to follow this explanation by thinking of the two  pat-
5490           terns like this:
5491    
5492             /dog(sbody)?/    is the same as  /dogsbody|dog/
5493             /dog(sbody)??/   is the same as  /dog|dogsbody/
5494    
5495           The  second  pattern  will  never  match "dogsbody" when pcre_exec() is
5496           used, because it will always find the shorter match first.
5497    
5498    
5499    PARTIAL MATCHING USING pcre_dfa_exec()
5500    
5501           The pcre_dfa_exec() function moves along the subject  string  character
5502           by  character, without backtracking, searching for all possible matches
5503           simultaneously. If the end of the subject is reached before the end  of
5504           the  pattern,  there  is the possibility of a partial match, again pro-
5505           vided that at least one character has matched.
5506    
5507           When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
5508           there  have  been  no complete matches. Otherwise, the complete matches
5509           are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
5510           takes  precedence  over any complete matches. The portion of the string
5511           that provided the longest partial match is set as  the  first  matching
5512           string, provided there are at least two slots in the offsets vector.
5513    
5514           Because  pcre_dfa_exec()  always searches for all possible matches, and
5515           there is no difference between greedy and ungreedy repetition, its  be-
5516           haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
5517           sider the string "dog"  matched  against  the  ungreedy  pattern  shown
5518           above:
5519    
5520             /dog(sbody)??/
5521    
5522           Whereas  pcre_exec()  stops  as soon as it finds the complete match for
5523           "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
5524           so returns that when PCRE_PARTIAL_HARD is set.
5525    
5526    
5527    PARTIAL MATCHING AND WORD BOUNDARIES
5528    
5529           If  a  pattern ends with one of sequences \w or \W, which test for word
5530           boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
5531           intuitive results. Consider this pattern:
5532    
5533             /\bcat\b/
5534    
5535           This matches "cat", provided there is a word boundary at either end. If
5536           the subject string is "the cat", the comparison of the final "t" with a
5537           following  character  cannot  take  place, so a partial match is found.
5538           However, pcre_exec() carries on with normal matching, which matches  \b
5539           at  the  end  of  the subject when the last character is a letter, thus
5540           finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
5541           TIAL.  The  same  thing  happens  with pcre_dfa_exec(), because it also
5542           finds the complete match.
5543    
5544           Using PCRE_PARTIAL_HARD in this  case  does  yield  PCRE_ERROR_PARTIAL,
5545           because then the partial match takes precedence.
5546    
5547    
5548    FORMERLY RESTRICTED PATTERNS
5549    
5550           For releases of PCRE prior to 8.00, because of the way certain internal
5551           optimizations  were  implemented  in  the  pcre_exec()  function,   the
5552           PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
5553           used with all patterns. From release 8.00 onwards, the restrictions  no
5554           longer  apply,  and  partial matching with pcre_exec() can be requested
5555           for any pattern.
5556    
5557           Items that were formerly restricted were repeated single characters and
5558           repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did
5559           not conform to the restrictions, pcre_exec() returned  the  error  code
5560           PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
5561           PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
5562           pattern can be used for partial matching now always returns 1.
5563    
5564    
5565  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
5566    
5567         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If  the  escape  sequence  \P  is  present in a pcretest data line, the
5568         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of
5569         uses the date example quoted above:         pcretest that uses the date example quoted above:
5570    
5571             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5572           data> 25jun04\P           data> 25jun04\P
5573            0: 25jun04            0: 25jun04
5574            1: jun            1: jun
5575           data> 25dec3\P           data> 25dec3\P
5576           Partial match           Partial match: 23dec3
5577           data> 3ju\P           data> 3ju\P
5578           Partial match           Partial match: 3ju
5579           data> 3juj\P           data> 3juj\P
5580           No match           No match
5581           data> j\P           data> j\P
# Line 5139  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 5583  EXAMPLE OF PARTIAL MATCHING USING PCRETE
5583    
5584         The  first  data  string  is  matched completely, so pcretest shows the         The  first  data  string  is  matched completely, so pcretest shows the
5585         matched substrings. The remaining four strings do not  match  the  com-         matched substrings. The remaining four strings do not  match  the  com-
5586         plete  pattern,  but  the first two are partial matches. The same test,         plete pattern, but the first two are partial matches. Similar output is
5587         using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),         obtained when pcre_dfa_exec() is used.
        produces the following output:  
5588    
5589             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/         If the escape sequence \P is present more than once in a pcretest  data
5590           data> 25jun04\P\D         line, the PCRE_PARTIAL_HARD option is set for the match.
           0: 25jun04  
          data> 23dec3\P\D  
          Partial match: 23dec3  
          data> 3ju\P\D  
          Partial match: 3ju  
          data> 3juj\P\D  
          No match  
          data> j\P\D  
          No match  
   
        Notice  that in this case the portion of the string that was matched is  
        made available.  
5591    
5592    
5593  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
5594    
5595         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
5596         ble  to  continue  the  match  by providing additional subject data and         ble to continue the match by  providing  additional  subject  data  and
5597         calling pcre_dfa_exec() again with the same  compiled  regular  expres-         calling  pcre_dfa_exec()  again  with the same compiled regular expres-
5598         sion, this time setting the PCRE_DFA_RESTART option. You must also pass         sion, this time setting the PCRE_DFA_RESTART option. You must pass  the
5599         the same working space as before, because this is where details of  the         same working space as before, because this is where details of the pre-
5600         previous  partial  match are stored. Here is an example using pcretest,         vious partial match are stored. Here  is  an  example  using  pcretest,
5601         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and         using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D
5602         \D are as above):         specifies the use of pcre_dfa_exec()):
5603    
5604             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5605           data> 23ja\P\D           data> 23ja\P\D
# Line 5176  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5607  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5607           data> n05\R\D           data> n05\R\D
5608            0: n05            0: n05
5609    
5610         The  first  call has "23ja" as the subject, and requests partial match-         The first call has "23ja" as the subject, and requests  partial  match-
5611         ing; the second call  has  "n05"  as  the  subject  for  the  continued         ing;  the  second  call  has  "n05"  as  the  subject for the continued
5612         (restarted)  match.   Notice  that when the match is complete, only the         (restarted) match.  Notice that when the match is  complete,  only  the
5613         last part is shown; PCRE does  not  retain  the  previously  partially-         last  part  is  shown;  PCRE  does not retain the previously partially-
5614         matched  string. It is up to the calling program to do that if it needs         matched string. It is up to the calling program to do that if it  needs
5615         to.         to.
5616    
5617         You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial         You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
5618         matching over multiple segments. This facility can be used to pass very         PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
5619         long subject strings to pcre_dfa_exec(). However, some care  is  needed         This  facility  can  be  used  to  pass  very  long  subject strings to
5620         for certain types of pattern.         pcre_dfa_exec().
5621    
5622         1.  If  the  pattern contains tests for the beginning or end of a line,  
5623         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-  MULTI-SEGMENT MATCHING WITH pcre_exec()
5624         ate,  when  the subject string for any call does not contain the begin-  
5625           From release 8.00, pcre_exec() can also be  used  to  do  multi-segment
5626           matching.  Unlike  pcre_dfa_exec(),  it  is not possible to restart the
5627           previous match with a new segment of data. Instead, new  data  must  be
5628           added  to  the  previous  subject  string, and the entire match re-run,
5629           starting from the point where the partial match occurred. Earlier  data
5630           can be discarded.  Consider an unanchored pattern that matches dates:
5631    
5632               re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
5633             data> The date is 23ja\P
5634             Partial match: 23ja
5635    
5636           The this stage, an application could discard the text preceding "23ja",
5637           add on text from the next segment, and call pcre_exec()  again.  Unlike
5638           pcre_dfa_exec(),  the  entire matching string must always be available,
5639           and the complete matching process occurs for each call, so more  memory
5640           and more processing time is needed.
5641    
5642    
5643    ISSUES WITH MULTI-SEGMENT MATCHING
5644    
5645           Certain types of pattern may give problems with multi-segment matching,
5646           whichever matching function is used.
5647    
5648           1. If the pattern contains tests for the beginning or end  of  a  line,
5649           you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
5650           ate, when the subject string for any call does not contain  the  begin-
5651         ning or end of a line.         ning or end of a line.
5652    
5653         2. If the pattern contains backward assertions (including  \b  or  \B),         2.  If  the  pattern contains backward assertions (including \b or \B),
5654         you  need  to  arrange for some overlap in the subject strings to allow         you need to arrange for some overlap in the subject  strings  to  allow
5655         for this. For example, you could pass the subject in  chunks  that  are         for  them  to  be  correctly tested at the start of each substring. For
5656         500  bytes long, but in a buffer of 700 bytes, with the starting offset         example, using pcre_dfa_exec(), you could pass the  subject  in  chunks
5657         set to 200 and the previous 200 bytes at the start of the buffer.         that  are 500 bytes long, but in a buffer of 700 bytes, with the start-
5658           ing offset set to 200 and the previous 200 bytes at the  start  of  the
5659         3. Matching a subject string that is split into multiple segments  does         buffer.
5660         not  always produce exactly the same result as matching over one single  
5661         long string.  The difference arises when there  are  multiple  matching         3.  Matching  a subject string that is split into multiple segments may
5662         possibilities,  because a partial match result is given only when there         not always produce exactly the same result as matching over one  single
5663         are no completed matches in a call to pcre_dfa_exec(). This means  that         long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section
5664         as  soon  as  the  shortest match has been found, continuation to a new         "Partial Matching and Word Boundaries" above describes  an  issue  that
5665         subject segment is no longer possible.  Consider this pcretest example:         arises  if  the  pattern ends with \b or \B. Another kind of difference
5666           may occur when there are multiple  matching  possibilities,  because  a
5667           partial match result is given only when there are no completed matches.
5668           This means that as soon as the shortest match has been found, continua-
5669           tion  to  a  new subject segment is no longer possible.  Consider again
5670           this pcretest example:
5671    
5672             re> /dog(sbody)?/             re> /dog(sbody)?/
5673             data> dogsb\P
5674              0: dog
5675           data> do\P\D           data> do\P\D
5676           Partial match: do           Partial match: do
5677           data> gsb\R\P\D           data> gsb\R\P\D
# Line 5216  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5680  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5680            0: dogsbody            0: dogsbody
5681            1: dog            1: dog
5682    
5683         The  pattern matches the words "dog" or "dogsbody". When the subject is         The first data line passes the string "dogsb" to  pcre_exec(),  setting
5684         presented in several parts ("do" and "gsb" being  the  first  two)  the         the  PCRE_PARTIAL_SOFT  option.  Although the string is a partial match
5685         match  stops  when "dog" has been found, and it is not possible to con-         for "dogsbody", the  result  is  not  PCRE_ERROR_PARTIAL,  because  the
5686         tinue. On the other hand,  if  "dogsbody"  is  presented  as  a  single         shorter  string  "dog" is a complete match. Similarly, when the subject
5687         string, both matches are found.         is presented to pcre_dfa_exec() in several parts ("do" and "gsb"  being
5688           the first two) the match stops when "dog" has been found, and it is not
5689           possible to continue. On the other hand, if "dogsbody" is presented  as
5690           a single string, pcre_dfa_exec() finds both matches.
5691    
5692           Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
5693           when matching multi-segment data. The example above then  behaves  dif-
5694           ferently:
5695    
5696               re> /dog(sbody)?/
5697             data> dogsb\P\P
5698             Partial match: dogsb
5699             data> do\P\D
5700             Partial match: do
5701             data> gsb\R\P\P\D
5702             Partial match: gsb
5703    
        Because  of  this  phenomenon,  it does not usually make sense to end a  
        pattern that is going to be matched in this way with a variable repeat.  
5704    
5705         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
5706         start with the same pattern item may not work as expected. For example,         start with the  same  pattern  item  may  not  work  as  expected  when
5707         consider this pattern:         pcre_dfa_exec() is used. For example, consider this pattern:
5708    
5709           1234|3789           1234|3789
5710    
# Line 5235  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5712  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5712         first alternative is found at offset 3. There is no partial  match  for         first alternative is found at offset 3. There is no partial  match  for
5713         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
5714         point in the subject string. Attempting to  continue  with  the  string         point in the subject string. Attempting to  continue  with  the  string
5715         "789" does not yield a match because only those alternatives that match         "7890"  does  not  yield  a  match because only those alternatives that
5716         at one point in the subject are remembered. The problem arises  because         match at one point in the subject are remembered.  The  problem  arises
5717         the  start  of the second alternative matches within the first alterna-         because  the  start  of the second alternative matches within the first
5718         tive. There is no problem with anchored patterns or patterns such as:         alternative. There is no problem with  anchored  patterns  or  patterns
5719           such as:
5720    
5721           1234|ABCD           1234|ABCD
5722    
5723         where no string can be a partial match for both alternatives.         where  no  string can be a partial match for both alternatives. This is
5724           not a problem if pcre_exec() is used, because the entire match  has  to
5725           be rerun each time:
5726    
5727               re> /1234|3789/
5728             data> ABC123\P
5729             Partial match: 123
5730             data> 1237890
5731              0: 3789
5732    
5733    
5734  AUTHOR  AUTHOR
# Line 5254  AUTHOR Line 5740  AUTHOR
5740    
5741  REVISION  REVISION
5742    
5743         Last updated: 04 June 2007         Last updated: 31 August 2009
5744         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5745  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5746    
5747    
5748  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
5749    
5750    
# Line 5381  REVISION Line 5867  REVISION
5867         Last updated: 13 June 2007         Last updated: 13 June 2007
5868         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5869  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5870    
5871    
5872  PCREPERFORM(3)                                                  PCREPERFORM(3)  PCREPERFORM(3)                                                  PCREPERFORM(3)
5873    
5874    
# Line 5531  REVISION Line 6017  REVISION
6017         Last updated: 06 March 2007         Last updated: 06 March 2007
6018         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
6019  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6020    
6021    
6022  PCREPOSIX(3)                                                      PCREPOSIX(3)  PCREPOSIX(3)                                                      PCREPOSIX(3)
6023    
6024    
# Line 5569  DESCRIPTION Line 6055  DESCRIPTION
6055         command  for  linking  an application that uses them. Because the POSIX         command  for  linking  an application that uses them. Because the POSIX
6056         functions call the native ones, it is also necessary to add -lpcre.         functions call the native ones, it is also necessary to add -lpcre.
6057    
6058         I have implemented only those option bits that can be reasonably mapped         I have implemented only those POSIX option bits that can be  reasonably
6059         to PCRE native options. In addition, the option REG_EXTENDED is defined         mapped  to PCRE native options. In addition, the option REG_EXTENDED is
6060         with the value zero. This has no effect, but since  programs  that  are         defined with the value zero. This has no  effect,  but  since  programs
6061         written  to  the  POSIX interface often use it, this makes it easier to         that  are  written  to  the POSIX interface often use it, this makes it
6062         slot in PCRE as a replacement library. Other POSIX options are not even         easier to slot in PCRE as a replacement library.  Other  POSIX  options
6063         defined.         are not even defined.
6064    
6065         When  PCRE  is  called  via these functions, it is only the API that is         When  PCRE  is  called  via these functions, it is only the API that is
6066         POSIX-like in style. The syntax and semantics of  the  regular  expres-         POSIX-like in style. The syntax and semantics of  the  regular  expres-
# Line 5650  COMPILING A PATTERN Line 6136  COMPILING A PATTERN
6136         is  public: re_nsub contains the number of capturing subpatterns in the         is  public: re_nsub contains the number of capturing subpatterns in the
6137         regular expression. Various error codes are defined in the header file.         regular expression. Various error codes are defined in the header file.
6138    
6139           NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
6140           use the contents of the preg structure. If, for example, you pass it to
6141           regexec(), the result is undefined and your program is likely to crash.
6142    
6143    
6144  MATCHING NEWLINE CHARACTERS  MATCHING NEWLINE CHARACTERS
6145    
6146         This area is not simple, because POSIX and Perl take different views of         This area is not simple, because POSIX and Perl take different views of
6147         things.  It is not possible to get PCRE to obey  POSIX  semantics,  but         things.   It  is  not possible to get PCRE to obey POSIX semantics, but
6148         then  PCRE was never intended to be a POSIX engine. The following table         then PCRE was never intended to be a POSIX engine. The following  table
6149         lists the different possibilities for matching  newline  characters  in         lists  the  different  possibilities for matching newline characters in
6150         PCRE:         PCRE:
6151    
6152                                   Default   Change with                                   Default   Change with
# Line 5678  MATCHING NEWLINE CHARACTERS Line 6168  MATCHING NEWLINE CHARACTERS
6168           ^ matches \n in middle     no     REG_NEWLINE           ^ matches \n in middle     no     REG_NEWLINE
6169    
6170         PCRE's behaviour is the same as Perl's, except that there is no equiva-         PCRE's behaviour is the same as Perl's, except that there is no equiva-
6171         lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is         lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
6172         no way to stop newline from matching [^a].         no way to stop newline from matching [^a].
6173    
6174         The   default  POSIX  newline  handling  can  be  obtained  by  setting         The  default  POSIX  newline  handling  can  be  obtained  by   setting
6175         PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE         PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
6176         behave exactly as for the REG_NEWLINE action.         behave exactly as for the REG_NEWLINE action.
6177    
6178    
6179  MATCHING A PATTERN  MATCHING A PATTERN
6180    
6181         The  function  regexec()  is  called  to  match a compiled pattern preg         The function regexec() is called  to  match  a  compiled  pattern  preg
6182         against a given string, which is terminated by a zero byte, subject  to         against  a  given string, which is by default terminated by a zero byte
6183         the options in eflags. These can be:         (but see REG_STARTEND below), subject to the options in  eflags.  These
6184           can be:
6185    
6186           REG_NOTBOL           REG_NOTBOL
6187    
6188         The PCRE_NOTBOL option is set when calling the underlying PCRE matching         The PCRE_NOTBOL option is set when calling the underlying PCRE matching
6189         function.         function.
6190    
6191             REG_NOTEMPTY
6192    
6193           The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
6194           ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
6195           However, setting this option can give more POSIX-like behaviour in some
6196           situations.
6197    
6198           REG_NOTEOL           REG_NOTEOL
6199    
6200         The PCRE_NOTEOL option is set when calling the underlying PCRE matching         The PCRE_NOTEOL option is set when calling the underlying PCRE matching
6201         function.         function.
6202    
6203             REG_STARTEND
6204    
6205           The string is considered to start at string +  pmatch[0].rm_so  and  to
6206           have  a terminating NUL located at string + pmatch[0].rm_eo (there need
6207           not actually be a NUL at that location), regardless  of  the  value  of
6208           nmatch.  This  is a BSD extension, compatible with but not specified by
6209           IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
6210           software intended to be portable to other systems. Note that a non-zero
6211           rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
6212           of the string, not how it is matched.
6213    
6214         If  the pattern was compiled with the REG_NOSUB flag, no data about any         If  the pattern was compiled with the REG_NOSUB flag, no data about any
6215         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
6216         regexec() are ignored.         regexec() are ignored.
# Line 5748  AUTHOR Line 6257  AUTHOR
6257    
6258  REVISION  REVISION
6259    
6260         Last updated: 06 March 2007         Last updated: 15 August 2009
6261         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
6262  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6263    
6264    
6265  PCRECPP(3)                                                          PCRECPP(3)  PCRECPP(3)                                                          PCRECPP(3)
6266    
6267    
# Line 5837  MATCHING INTERFACE Line 6346  MATCHING INTERFACE
6346    
6347           c. The "i"th argument has a suitable type for holding the           c. The "i"th argument has a suitable type for holding the
6348              string captured as the "i"th sub-pattern. If you pass in              string captured as the "i"th sub-pattern. If you pass in
6349              NULL for the "i"th argument, or pass fewer arguments than              void * NULL for the "i"th argument, or a non-void * NULL
6350                of the correct type, or pass fewer arguments than the
6351              number of sub-patterns, "i"th captured sub-pattern is              number of sub-patterns, "i"th captured sub-pattern is
6352              ignored.              ignored.
6353    
# Line 5852  MATCHING INTERFACE Line 6362  MATCHING INTERFACE
6362         need    more,    consider    using    the    more   general   interface         need    more,    consider    using    the    more   general   interface
6363         pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.         pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
6364    
6365           NOTE: Do not use no_arg, which is used internally to mark the end of  a
6366           list  of optional arguments, as a placeholder for missing arguments, as
6367           this can lead to segfaults.
6368    
6369    
6370  QUOTING METACHARACTERS  QUOTING METACHARACTERS
6371    
# Line 6085  AUTHOR Line 6599  AUTHOR
6599    
6600  REVISION  REVISION
6601    
6602         Last updated: 06 March 2007         Last updated: 17 March 2009
6603  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6604    
6605    
6606  PCRESAMPLE(3)                                                    PCRESAMPLE(3)  PCRESAMPLE(3)                                                    PCRESAMPLE(3)
6607    
6608    
# Line 6099  NAME Line 6613  NAME
6613  PCRE SAMPLE PROGRAM  PCRE SAMPLE PROGRAM
6614    
6615         A simple, complete demonstration program, to get you started with using         A simple, complete demonstration program, to get you started with using
6616         PCRE, is supplied in the file pcredemo.c in the PCRE distribution.         PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
6617           listing  of this program is given in the pcredemo documentation. If you
6618           do not have a copy of the PCRE distribution, you can save this  listing
6619           to re-create pcredemo.c.
6620    
6621         The program compiles the regular expression that is its first argument,         The program compiles the regular expression that is its first argument,
6622         and  matches  it  against the subject string in its second argument. No         and matches it against the subject string in its  second  argument.  No
6623         PCRE options are set, and default character tables are used. If  match-         PCRE  options are set, and default character tables are used. If match-
6624         ing  succeeds,  the  program  outputs  the  portion of the subject that         ing succeeds, the program outputs  the  portion  of  the  subject  that
6625         matched, together with the contents of any captured substrings.         matched, together with the contents of any captured substrings.
6626    
6627         If the -g option is given on the command line, the program then goes on         If the -g option is given on the command line, the program then goes on
6628         to check for further matches of the same regular expression in the same         to check for further matches of the same regular expression in the same
6629         subject string. The logic is a little bit tricky because of the  possi-         subject  string. The logic is a little bit tricky because of the possi-
6630         bility  of  matching an empty string. Comments in the code explain what         bility of matching an empty string. Comments in the code  explain  what
6631         is going on.         is going on.
6632    
6633         The demonstration program is automatically built if you use  "./config-         If  PCRE  is  installed in the standard include and library directories
6634         ure;make"  to  build PCRE. Otherwise, if PCRE is installed in the stan-         for your system, you should be able to compile the  demonstration  pro-
6635         dard include and library directories for your  system,  you  should  be         gram using this command:
        able to compile the demonstration program using this command:  
6636    
6637           gcc -o pcredemo pcredemo.c -lpcre           gcc -o pcredemo pcredemo.c -lpcre
6638    
# Line 6139  PCRE SAMPLE PROGRAM Line 6655  PCRE SAMPLE PROGRAM
6655         expressions and the PCRE library. The pcredemo program is provided as a         expressions and the PCRE library. The pcredemo program is provided as a
6656         simple coding example.         simple coding example.
6657    
6658         On some operating systems (e.g. Solaris), when PCRE is not installed in         When you try to run pcredemo when PCRE is not installed in the standard
6659         the standard library directory, you may get an error like this when you         library  directory,  you  may  get an error like this on some operating
6660         try to run pcredemo:         systems (e.g. Solaris):
6661    
6662           ld.so.1: a.out: fatal: libpcre.so.0: open failed:  No  such  file  or           ld.so.1: a.out: fatal: libpcre.so.0: open failed:  No  such  file  or
6663         directory         directory
# Line 6163  AUTHOR Line 6679  AUTHOR
6679    
6680  REVISION  REVISION
6681    
6682         Last updated: 13 June 2007         Last updated: 01 September 2009
6683         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
6684  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6685  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
6686    
# Line 6230  PCRE DISCUSSION OF STACK USAGE Line 6746  PCRE DISCUSSION OF STACK USAGE
6746         ing long subject strings is to write repeated parenthesized subpatterns         ing long subject strings is to write repeated parenthesized subpatterns
6747         to match more than one character whenever possible.         to match more than one character whenever possible.
6748    
6749       Compiling PCRE to use heap instead of stack
6750    
6751         In environments where stack memory is constrained, you  might  want  to         In environments where stack memory is constrained, you  might  want  to
6752         compile  PCRE to use heap memory instead of stack for remembering back-         compile  PCRE to use heap memory instead of stack for remembering back-
6753         up points. This makes it run a lot more slowly, however. Details of how         up points. This makes it run a lot more slowly, however. Details of how
# Line 6242  PCRE DISCUSSION OF STACK USAGE Line 6760  PCRE DISCUSSION OF STACK USAGE
6760         freed in reverse order, it may be possible to implement customized mem-         freed in reverse order, it may be possible to implement customized mem-
6761         ory handlers that are more efficient than the standard functions.         ory handlers that are more efficient than the standard functions.
6762    
6763       Limiting PCRE's stack usage
6764    
6765           PCRE has an internal counter that can be used to  limit  the  depth  of
6766           recursion,  and  thus cause pcre_exec() to give an error code before it
6767           runs out of stack. By default, the limit is very  large,  and  unlikely
6768           ever  to operate. It can be changed when PCRE is built, and it can also
6769           be set when pcre_exec() is called. For details of these interfaces, see
6770           the pcrebuild and pcreapi documentation.
6771    
6772           As a very rough rule of thumb, you should reckon on about 500 bytes per
6773           recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
6774           should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
6775           hand, can support around 128000 recursions. The pcretest  test  program
6776           has a command line option (-S) that can be used to increase the size of
6777           its stack.
6778    
6779       Changing stack size in Unix-like systems
6780    
6781         In Unix-like environments, there is not often a problem with the  stack         In Unix-like environments, there is not often a problem with the  stack
6782         unless  very  long  strings  are  involved, though the default limit on         unless  very  long  strings  are  involved, though the default limit on
6783         stack size varies from system to system. Values from 8Mb  to  64Mb  are         stack size varies from system to system. Values from 8Mb  to  64Mb  are
# Line 6262  PCRE DISCUSSION OF STACK USAGE Line 6798  PCRE DISCUSSION OF STACK USAGE
6798         attempts to increase the soft limit to  100Mb  using  setrlimit().  You         attempts to increase the soft limit to  100Mb  using  setrlimit().  You
6799         must do this before calling pcre_exec().         must do this before calling pcre_exec().
6800    
6801         PCRE  has  an  internal  counter that can be used to limit the depth of     Changing stack size in Mac OS X
        recursion, and thus cause pcre_exec() to give an error code  before  it  
        runs  out  of  stack. By default, the limit is very large, and unlikely  
        ever to operate. It can be changed when PCRE is built, and it can  also  
        be set when pcre_exec() is called. For details of these interfaces, see  
        the pcrebuild and pcreapi documentation.  
6802    
6803         As a very rough rule of thumb, you should reckon on about 500 bytes per         Using setrlimit(), as described above, should also work on Mac OS X. It
6804         recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you         is also possible to set a stack size when linking a program. There is a
6805         should set the limit at 16000 recursions. A 64Mb stack,  on  the  other         discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
6806         hand,  can  support around 128000 recursions. The pcretest test program         http://developer.apple.com/qa/qa2005/qa1419.html.
        has a command line option (-S) that can be used to increase the size of  
        its stack.  
6807