/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 406 by ph10, Mon Mar 23 12:05:43 2009 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 2  Line 2 
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. There are  synopses of each function in the library have not been included. Neither has
6  separate text files for the pcregrep and pcretest commands.  the pcredemo program. There are separate text files for the pcregrep and
7    pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
9    
10    
# Line 24  INTRODUCTION Line 25  INTRODUCTION
25         tax items, and there is an option for  requesting  some  minor  changes         tax items, and there is an option for  requesting  some  minor  changes
26         that give better JavaScript compatibility.         that give better JavaScript compatibility.
27    
28         The  current  implementation of PCRE (release 7.x) corresponds approxi-         The  current implementation of PCRE (release 8.xx) corresponds approxi-
29         mately with Perl 5.10, including support for UTF-8 encoded strings  and         mately with Perl 5.10, including support for UTF-8 encoded strings  and
30         Unicode general category properties. However, UTF-8 and Unicode support         Unicode general category properties. However, UTF-8 and Unicode support
31         has to be explicitly enabled; it is not the default. The Unicode tables         has to be explicitly enabled; it is not the default. The Unicode tables
32         correspond to Unicode release 5.0.0.         correspond to Unicode release 5.1.
33    
34         In  addition to the Perl-compatible matching function, PCRE contains an         In  addition to the Perl-compatible matching function, PCRE contains an
35         alternative matching function that matches the same  compiled  patterns         alternative matching function that matches the same  compiled  patterns
# Line 71  USER DOCUMENTATION Line 72  USER DOCUMENTATION
72         The user documentation for PCRE comprises a number  of  different  sec-         The user documentation for PCRE comprises a number  of  different  sec-
73         tions.  In the "man" format, each of these is a separate "man page". In         tions.  In the "man" format, each of these is a separate "man page". In
74         the HTML format, each is a separate page, linked from the  index  page.         the HTML format, each is a separate page, linked from the  index  page.
75         In  the  plain text format, all the sections are concatenated, for ease         In  the  plain  text format, all the sections, except the pcredemo sec-
76         of searching. The sections are as follows:         tion, are concatenated, for ease of searching. The sections are as fol-
77           lows:
78    
79           pcre              this document           pcre              this document
80           pcre-config       show PCRE installation configuration information           pcre-config       show PCRE installation configuration information
# Line 81  USER DOCUMENTATION Line 83  USER DOCUMENTATION
83           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
84           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
85           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper
86             pcredemo          a demonstration C program that uses PCRE
87           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
88           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
89           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
# Line 90  USER DOCUMENTATION Line 93  USER DOCUMENTATION
93           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
94           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
95           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
96           pcresample        discussion of the sample program           pcresample        discussion of the pcredemo program
97           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
98           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
99    
100         In addition, in the "man" and HTML formats, there is a short  page  for         In  addition,  in the "man" and HTML formats, there is a short page for
101         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
102    
103    
104  LIMITATIONS  LIMITATIONS
105    
106         There  are some size limitations in PCRE but it is hoped that they will         There are some size limitations in PCRE but it is hoped that they  will
107         never in practice be relevant.         never in practice be relevant.
108    
109         The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
110         is compiled with the default internal linkage size of 2. If you want to         is compiled with the default internal linkage size of 2. If you want to
111         process regular expressions that are truly enormous,  you  can  compile         process  regular  expressions  that are truly enormous, you can compile
112         PCRE  with  an  internal linkage size of 3 or 4 (see the README file in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
113         the source distribution and the pcrebuild documentation  for  details).         the  source  distribution and the pcrebuild documentation for details).
114         In  these  cases the limit is substantially larger.  However, the speed         In these cases the limit is substantially larger.  However,  the  speed
115         of execution is slower.         of execution is slower.
116    
117         All values in repeating quantifiers must be less than 65536.         All values in repeating quantifiers must be less than 65536.
# Line 119  LIMITATIONS Line 122  LIMITATIONS
122         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
123         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
124    
125         The maximum length of a subject string is the largest  positive  number         The  maximum  length of a subject string is the largest positive number
126         that  an integer variable can hold. However, when using the traditional         that an integer variable can hold. However, when using the  traditional
127         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
128         inite  repetition.  This means that the available stack space may limit         inite repetition.  This means that the available stack space may  limit
129         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
130         For a discussion of stack issues, see the pcrestack documentation.         For a discussion of stack issues, see the pcrestack documentation.
131    
132    
133  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
134    
135         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
136         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
137         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
138         port for Unicode general category properties was added.         port for Unicode general category properties was added.
139    
140         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
143         any  subject  strings  that are matched against it are treated as UTF-8         sequence (*UTF8). When either of these is the case,  both  the  pattern
144         strings instead of just strings of bytes.         and  any  subject  strings  that  are matched against it are treated as
145           UTF-8 strings instead of just strings of bytes.
146    
147         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If you compile PCRE with UTF-8 support, but do not use it at run  time,
148         the  library will be a bit bigger, but the additional run time overhead         the  library will be a bit bigger, but the additional run time overhead
# Line 259  AUTHOR Line 263  AUTHOR
263    
264  REVISION  REVISION
265    
266         Last updated: 18 March 2009         Last updated: 01 September 2009
267         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
268  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
269    
270    
271  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
272    
273    
# Line 278  PCRE BUILD-TIME OPTIONS Line 282  PCRE BUILD-TIME OPTIONS
282         script,  where the optional features are selected or deselected by pro-         script,  where the optional features are selected or deselected by pro-
283         viding options to configure before running the make  command.  However,         viding options to configure before running the make  command.  However,
284         the  same  options  can be selected in both Unix-like and non-Unix-like         the  same  options  can be selected in both Unix-like and non-Unix-like
285         environments using the GUI facility of  CMakeSetup  if  you  are  using         environments using the GUI facility of cmake-gui if you are using CMake
286         CMake instead of configure to build PCRE.         instead of configure to build PCRE.
287    
288           There  is  a  lot more information about building PCRE in non-Unix-like
289           environments in the file called NON_UNIX_USE, which is part of the PCRE
290           distribution.  You  should consult this file as well as the README file
291           if you are building in a non-Unix-like environment.
292    
293         The complete list of options for configure (which includes the standard         The complete list of options for configure (which includes the standard
294         ones such as the  selection  of  the  installation  directory)  can  be         ones  such  as  the  selection  of  the  installation directory) can be
295         obtained by running         obtained by running
296    
297           ./configure --help           ./configure --help
298    
299         The  following  sections  include  descriptions  of options whose names         The following sections include  descriptions  of  options  whose  names
300         begin with --enable or --disable. These settings specify changes to the         begin with --enable or --disable. These settings specify changes to the
301         defaults  for  the configure command. Because of the way that configure         defaults for the configure command. Because of the way  that  configure
302         works, --enable and --disable always come in pairs, so  the  complemen-         works,  --enable  and --disable always come in pairs, so the complemen-
303         tary  option always exists as well, but as it specifies the default, it         tary option always exists as well, but as it specifies the default,  it
304         is not described.         is not described.
305    
306    
# Line 312  UTF-8 SUPPORT Line 321  UTF-8 SUPPORT
321    
322           --enable-utf8           --enable-utf8
323    
324         to the configure command. Of itself, this  does  not  make  PCRE  treat         to  the  configure  command.  Of  itself, this does not make PCRE treat
325         strings  as UTF-8. As well as compiling PCRE with this option, you also         strings as UTF-8. As well as compiling PCRE with this option, you  also
326         have have to set the PCRE_UTF8 option when you call the  pcre_compile()         have  have to set the PCRE_UTF8 option when you call the pcre_compile()
327         function.         function.
328    
329         If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE         If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE
330         expects its input to be either ASCII or UTF-8 (depending on the runtime         expects its input to be either ASCII or UTF-8 (depending on the runtime
331         option).  It  is not possible to support both EBCDIC and UTF-8 codes in         option). It is not possible to support both EBCDIC and UTF-8  codes  in
332         the same  version  of  the  library.  Consequently,  --enable-utf8  and         the  same  version  of  the  library.  Consequently,  --enable-utf8 and
333         --enable-ebcdic are mutually exclusive.         --enable-ebcdic are mutually exclusive.
334    
335    
336  UNICODE CHARACTER PROPERTY SUPPORT  UNICODE CHARACTER PROPERTY SUPPORT
337    
338         UTF-8  support allows PCRE to process character values greater than 255         UTF-8 support allows PCRE to process character values greater than  255
339         in the strings that it handles. On its own, however, it does  not  pro-         in  the  strings that it handles. On its own, however, it does not pro-
340         vide any facilities for accessing the properties of such characters. If         vide any facilities for accessing the properties of such characters. If
341         you want to be able to use the pattern escapes \P, \p,  and  \X,  which         you  want  to  be able to use the pattern escapes \P, \p, and \X, which
342         refer to Unicode character properties, you must add         refer to Unicode character properties, you must add
343    
344           --enable-unicode-properties           --enable-unicode-properties
345    
346         to  the configure command. This implies UTF-8 support, even if you have         to the configure command. This implies UTF-8 support, even if you  have
347         not explicitly requested it.         not explicitly requested it.
348    
349         Including Unicode property support adds around 30K  of  tables  to  the         Including  Unicode  property  support  adds around 30K of tables to the
350         PCRE  library.  Only  the general category properties such as Lu and Nd         PCRE library. Only the general category properties such as  Lu  and  Nd
351         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
352    
353    
354  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
355    
356         By default, PCRE interprets the linefeed (LF) character  as  indicating         By  default,  PCRE interprets the linefeed (LF) character as indicating
357         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
358         systems. You can compile PCRE to use carriage return (CR)  instead,  by         systems.  You  can compile PCRE to use carriage return (CR) instead, by
359         adding         adding
360    
361           --enable-newline-is-cr           --enable-newline-is-cr
362    
363         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
364         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
365    
366         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 363  CODE VALUE OF NEWLINE Line 372  CODE VALUE OF NEWLINE
372    
373           --enable-newline-is-anycrlf           --enable-newline-is-anycrlf
374    
375         which  causes  PCRE  to recognize any of the three sequences CR, LF, or         which causes PCRE to recognize any of the three sequences  CR,  LF,  or
376         CRLF as indicating a line ending. Finally, a fifth option, specified by         CRLF as indicating a line ending. Finally, a fifth option, specified by
377    
378           --enable-newline-is-any           --enable-newline-is-any
379    
380         causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
381    
382         Whatever line ending convention is selected when PCRE is built  can  be         Whatever  line  ending convention is selected when PCRE is built can be
383         overridden  when  the library functions are called. At build time it is         overridden when the library functions are called. At build time  it  is
384         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
385    
386    
387  WHAT \R MATCHES  WHAT \R MATCHES
388    
389         By default, the sequence \R in a pattern matches  any  Unicode  newline         By  default,  the  sequence \R in a pattern matches any Unicode newline
390         sequence,  whatever  has  been selected as the line ending sequence. If         sequence, whatever has been selected as the line  ending  sequence.  If
391         you specify         you specify
392    
393           --enable-bsr-anycrlf           --enable-bsr-anycrlf
394    
395         the default is changed so that \R matches only CR, LF, or  CRLF.  What-         the  default  is changed so that \R matches only CR, LF, or CRLF. What-
396         ever  is selected when PCRE is built can be overridden when the library         ever is selected when PCRE is built can be overridden when the  library
397         functions are called.         functions are called.
398    
399    
400  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
401    
402         The PCRE building process uses libtool to build both shared and  static         The  PCRE building process uses libtool to build both shared and static
403         Unix  libraries by default. You can suppress one of these by adding one         Unix libraries by default. You can suppress one of these by adding  one
404         of         of
405    
406           --disable-shared           --disable-shared
# Line 403  BUILDING SHARED AND STATIC LIBRARIES Line 412  BUILDING SHARED AND STATIC LIBRARIES
412  POSIX MALLOC USAGE  POSIX MALLOC USAGE
413    
414         When PCRE is called through the POSIX interface (see the pcreposix doc-         When PCRE is called through the POSIX interface (see the pcreposix doc-
415         umentation),  additional  working  storage  is required for holding the         umentation), additional working storage is  required  for  holding  the
416         pointers to capturing substrings, because PCRE requires three  integers         pointers  to capturing substrings, because PCRE requires three integers
417         per  substring,  whereas  the POSIX interface provides only two. If the         per substring, whereas the POSIX interface provides only  two.  If  the
418         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
419         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
420         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 418  POSIX MALLOC USAGE Line 427  POSIX MALLOC USAGE
427    
428  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
429    
430         Within  a  compiled  pattern,  offset values are used to point from one         Within a compiled pattern, offset values are used  to  point  from  one
431         part to another (for example, from an opening parenthesis to an  alter-         part  to another (for example, from an opening parenthesis to an alter-
432         nation  metacharacter).  By default, two-byte values are used for these         nation metacharacter). By default, two-byte values are used  for  these
433         offsets, leading to a maximum size for a  compiled  pattern  of  around         offsets,  leading  to  a  maximum size for a compiled pattern of around
434         64K.  This  is sufficient to handle all but the most gigantic patterns.         64K. This is sufficient to handle all but the most  gigantic  patterns.
435         Nevertheless, some people do want to process enormous patterns,  so  it         Nevertheless,  some  people do want to process enormous patterns, so it
436         is  possible  to compile PCRE to use three-byte or four-byte offsets by         is possible to compile PCRE to use three-byte or four-byte  offsets  by
437         adding a setting such as         adding a setting such as
438    
439           --with-link-size=3           --with-link-size=3
440    
441         to the configure command. The value given must be 2,  3,  or  4.  Using         to  the  configure  command.  The value given must be 2, 3, or 4. Using
442         longer  offsets slows down the operation of PCRE because it has to load         longer offsets slows down the operation of PCRE because it has to  load
443         additional bytes when handling them.         additional bytes when handling them.
444    
445    
446  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
447    
448         When matching with the pcre_exec() function, PCRE implements backtrack-         When matching with the pcre_exec() function, PCRE implements backtrack-
449         ing  by  making recursive calls to an internal function called match().         ing by making recursive calls to an internal function  called  match().
450         In environments where the size of the stack is limited,  this  can  se-         In  environments  where  the size of the stack is limited, this can se-
451         verely  limit  PCRE's operation. (The Unix environment does not usually         verely limit PCRE's operation. (The Unix environment does  not  usually
452         suffer from this problem, but it may sometimes be necessary to increase         suffer from this problem, but it may sometimes be necessary to increase
453         the  maximum  stack size.  There is a discussion in the pcrestack docu-         the maximum stack size.  There is a discussion in the  pcrestack  docu-
454         mentation.) An alternative approach to recursion that uses memory  from         mentation.)  An alternative approach to recursion that uses memory from
455         the  heap  to remember data, instead of using recursive function calls,         the heap to remember data, instead of using recursive  function  calls,
456         has been implemented to work round the problem of limited  stack  size.         has  been  implemented to work round the problem of limited stack size.
457         If you want to build a version of PCRE that works this way, add         If you want to build a version of PCRE that works this way, add
458    
459           --disable-stack-for-recursion           --disable-stack-for-recursion
460    
461         to  the  configure  command. With this configuration, PCRE will use the         to the configure command. With this configuration, PCRE  will  use  the
462         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
463         ment  functions. By default these point to malloc() and free(), but you         ment functions. By default these point to malloc() and free(), but  you
464         can replace the pointers so that your own functions are used.         can replace the pointers so that your own functions are used.
465    
466         Separate functions are  provided  rather  than  using  pcre_malloc  and         Separate  functions  are  provided  rather  than  using pcre_malloc and
467         pcre_free  because  the  usage  is  very  predictable:  the block sizes         pcre_free because the  usage  is  very  predictable:  the  block  sizes
468         requested are always the same, and  the  blocks  are  always  freed  in         requested  are  always  the  same,  and  the blocks are always freed in
469         reverse  order.  A calling program might be able to implement optimized         reverse order. A calling program might be able to  implement  optimized
470         functions that perform better  than  malloc()  and  free().  PCRE  runs         functions  that  perform  better  than  malloc()  and free(). PCRE runs
471         noticeably more slowly when built in this way. This option affects only         noticeably more slowly when built in this way. This option affects only
472         the  pcre_exec()  function;  it   is   not   relevant   for   the   the         the   pcre_exec()   function;   it   is   not   relevant  for  the  the
473         pcre_dfa_exec() function.         pcre_dfa_exec() function.
474    
475    
476  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
477    
478         Internally,  PCRE has a function called match(), which it calls repeat-         Internally, PCRE has a function called match(), which it calls  repeat-
479         edly  (sometimes  recursively)  when  matching  a  pattern   with   the         edly   (sometimes   recursively)  when  matching  a  pattern  with  the
480         pcre_exec()  function.  By controlling the maximum number of times this         pcre_exec() function. By controlling the maximum number of  times  this
481         function may be called during a single matching operation, a limit  can         function  may be called during a single matching operation, a limit can
482         be  placed  on  the resources used by a single call to pcre_exec(). The         be placed on the resources used by a single call  to  pcre_exec().  The
483         limit can be changed at run time, as described in the pcreapi  documen-         limit  can be changed at run time, as described in the pcreapi documen-
484         tation.  The default is 10 million, but this can be changed by adding a         tation. The default is 10 million, but this can be changed by adding  a
485         setting such as         setting such as
486    
487           --with-match-limit=500000           --with-match-limit=500000
488    
489         to  the  configure  command.  This  setting  has  no  effect   on   the         to   the   configure  command.  This  setting  has  no  effect  on  the
490         pcre_dfa_exec() matching function.         pcre_dfa_exec() matching function.
491    
492         In  some  environments  it is desirable to limit the depth of recursive         In some environments it is desirable to limit the  depth  of  recursive
493         calls of match() more strictly than the total number of calls, in order         calls of match() more strictly than the total number of calls, in order
494         to  restrict  the maximum amount of stack (or heap, if --disable-stack-         to restrict the maximum amount of stack (or heap,  if  --disable-stack-
495         for-recursion is specified) that is used. A second limit controls this;         for-recursion is specified) that is used. A second limit controls this;
496         it  defaults  to  the  value  that is set for --with-match-limit, which         it defaults to the value that  is  set  for  --with-match-limit,  which
497         imposes no additional constraints. However, you can set a  lower  limit         imposes  no  additional constraints. However, you can set a lower limit
498         by adding, for example,         by adding, for example,
499    
500           --with-match-limit-recursion=10000           --with-match-limit-recursion=10000
501    
502         to  the  configure  command.  This  value can also be overridden at run         to the configure command. This value can  also  be  overridden  at  run
503         time.         time.
504    
505    
506  CREATING CHARACTER TABLES AT BUILD TIME  CREATING CHARACTER TABLES AT BUILD TIME
507    
508         PCRE uses fixed tables for processing characters whose code values  are         PCRE  uses fixed tables for processing characters whose code values are
509         less  than 256. By default, PCRE is built with a set of tables that are         less than 256. By default, PCRE is built with a set of tables that  are
510         distributed in the file pcre_chartables.c.dist. These  tables  are  for         distributed  in  the  file pcre_chartables.c.dist. These tables are for
511         ASCII codes only. If you add         ASCII codes only. If you add
512    
513           --enable-rebuild-chartables           --enable-rebuild-chartables
514    
515         to  the  configure  command, the distributed tables are no longer used.         to the configure command, the distributed tables are  no  longer  used.
516         Instead, a program called dftables is compiled and  run.  This  outputs         Instead,  a  program  called dftables is compiled and run. This outputs
517         the source for new set of tables, created in the default locale of your         the source for new set of tables, created in the default locale of your
518         C runtime system. (This method of replacing the tables does not work if         C runtime system. (This method of replacing the tables does not work if
519         you  are cross compiling, because dftables is run on the local host. If         you are cross compiling, because dftables is run on the local host.  If
520         you need to create alternative tables when cross  compiling,  you  will         you  need  to  create alternative tables when cross compiling, you will
521         have to do so "by hand".)         have to do so "by hand".)
522    
523    
524  USING EBCDIC CODE  USING EBCDIC CODE
525    
526         PCRE  assumes  by  default that it will run in an environment where the         PCRE assumes by default that it will run in an  environment  where  the
527         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
528         This  is  the  case for most computer operating systems. PCRE can, how-         This is the case for most computer operating systems.  PCRE  can,  how-
529         ever, be compiled to run in an EBCDIC environment by adding         ever, be compiled to run in an EBCDIC environment by adding
530    
531           --enable-ebcdic           --enable-ebcdic
532    
533         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
534         bles.  You  should  only  use  it if you know that you are in an EBCDIC         bles. You should only use it if you know that  you  are  in  an  EBCDIC
535         environment (for example,  an  IBM  mainframe  operating  system).  The         environment  (for  example,  an  IBM  mainframe  operating system). The
536         --enable-ebcdic option is incompatible with --enable-utf8.         --enable-ebcdic option is incompatible with --enable-utf8.
537    
538    
# Line 537  PCREGREP OPTIONS FOR COMPRESSED FILE SUP Line 546  PCREGREP OPTIONS FOR COMPRESSED FILE SUP
546           --enable-pcregrep-libbz2           --enable-pcregrep-libbz2
547    
548         to the configure command. These options naturally require that the rel-         to the configure command. These options naturally require that the rel-
549         evant libraries are installed on your system. Configuration  will  fail         evant  libraries  are installed on your system. Configuration will fail
550         if they are not.         if they are not.
551    
552    
# Line 547  PCRETEST OPTION FOR LIBREADLINE SUPPORT Line 556  PCRETEST OPTION FOR LIBREADLINE SUPPORT
556    
557           --enable-pcretest-libreadline           --enable-pcretest-libreadline
558    
559         to  the  configure  command,  pcretest  is  linked with the libreadline         to the configure command,  pcretest  is  linked  with  the  libreadline
560         library, and when its input is from a terminal, it reads it  using  the         library,  and  when its input is from a terminal, it reads it using the
561         readline() function. This provides line-editing and history facilities.         readline() function. This provides line-editing and history facilities.
562         Note that libreadline is GPL-licenced, so if you distribute a binary of         Note that libreadline is GPL-licenced, so if you distribute a binary of
563         pcretest linked in this way, there may be licensing issues.         pcretest linked in this way, there may be licensing issues.
564    
565         Setting  this  option  causes  the -lreadline option to be added to the         Setting this option causes the -lreadline option to  be  added  to  the
566         pcretest build. In many operating environments with  a  sytem-installed         pcretest  build.  In many operating environments with a sytem-installed
567         libreadline this is sufficient. However, in some environments (e.g.  if         libreadline this is sufficient. However, in some environments (e.g.  if
568         an unmodified distribution version of readline is in use),  some  extra         an  unmodified  distribution version of readline is in use), some extra
569         configuration  may  be necessary. The INSTALL file for libreadline says         configuration may be necessary. The INSTALL file for  libreadline  says
570         this:         this:
571    
572           "Readline uses the termcap functions, but does not link with the           "Readline uses the termcap functions, but does not link with the
573           termcap or curses library itself, allowing applications which link           termcap or curses library itself, allowing applications which link
574           with readline the to choose an appropriate library."           with readline the to choose an appropriate library."
575    
576         If your environment has not been set up so that an appropriate  library         If  your environment has not been set up so that an appropriate library
577         is automatically included, you may need to add something like         is automatically included, you may need to add something like
578    
579           LIBS="-ncurses"           LIBS="-ncurses"
# Line 586  AUTHOR Line 595  AUTHOR
595    
596  REVISION  REVISION
597    
598         Last updated: 17 March 2009         Last updated: 06 September 2009
599         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
600  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
601    
602    
603  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
604    
605    
# Line 692  THE ALTERNATIVE MATCHING ALGORITHM Line 701  THE ALTERNATIVE MATCHING ALGORITHM
701         at the fourth character of the subject. The algorithm does not automat-         at the fourth character of the subject. The algorithm does not automat-
702         ically move on to find matches that start at later positions.         ically move on to find matches that start at later positions.
703    
704           Although the general principle of this matching algorithm  is  that  it
705           scans  the subject string only once, without backtracking, there is one
706           exception: when a lookbehind assertion is  encountered,  the  preceding
707           characters have to be re-inspected.
708    
709         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
710         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
711    
712         1.  Because  the  algorithm  finds  all possible matches, the greedy or         1. Because the algorithm finds all  possible  matches,  the  greedy  or
713         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
714         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
715         sessive quantifiers can make a difference when what follows could  also         sessive  quantifiers can make a difference when what follows could also
716         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
717    
718           ^a++\w!           ^a++\w!
719    
720         This  pattern matches "aaab!" but not "aaa!", which would be matched by         This pattern matches "aaab!" but not "aaa!", which would be matched  by
721         a non-possessive quantifier. Similarly, if an atomic group is  present,         a  non-possessive quantifier. Similarly, if an atomic group is present,
722         it  is matched as if it were a standalone pattern at the current point,         it is matched as if it were a standalone pattern at the current  point,
723         and the longest match is then "locked in" for the rest of  the  overall         and  the  longest match is then "locked in" for the rest of the overall
724         pattern.         pattern.
725    
726         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
727         is not straightforward to keep track of  captured  substrings  for  the         is  not  straightforward  to  keep track of captured substrings for the
728         different  matching  possibilities,  and  PCRE's implementation of this         different matching possibilities, and  PCRE's  implementation  of  this
729         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
730         strings are available.         strings are available.
731    
732         3.  Because no substrings are captured, back references within the pat-         3. Because no substrings are captured, back references within the  pat-
733         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
734    
735         4. For the same reason, conditional expressions that use  a  backrefer-         4.  For  the same reason, conditional expressions that use a backrefer-
736         ence  as  the  condition or test for a specific group recursion are not         ence as the condition or test for a specific group  recursion  are  not
737         supported.         supported.
738    
739         5. Because many paths through the tree may be  active,  the  \K  escape         5.  Because  many  paths  through the tree may be active, the \K escape
740         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
741         be on some paths and not on others), is not  supported.  It  causes  an         be  on  some  paths  and not on others), is not supported. It causes an
742         error if encountered.         error if encountered.
743    
744         6.  Callouts  are  supported, but the value of the capture_top field is         6. Callouts are supported, but the value of the  capture_top  field  is
745         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
746    
747         7. The \C escape sequence, which (in the standard algorithm) matches  a         7.  The \C escape sequence, which (in the standard algorithm) matches a
748         single  byte, even in UTF-8 mode, is not supported because the alterna-         single byte, even in UTF-8 mode, is not supported because the  alterna-
749         tive algorithm moves through the subject  string  one  character  at  a         tive  algorithm  moves  through  the  subject string one character at a
750         time, for all active paths through the tree.         time, for all active paths through the tree.
751    
752         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
753         are not supported. (*FAIL) is supported, and  behaves  like  a  failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
754         negative assertion.         negative assertion.
755    
756    
757  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
758    
759         Using  the alternative matching algorithm provides the following advan-         Using the alternative matching algorithm provides the following  advan-
760         tages:         tages:
761    
762         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
763         ically  found,  and  in particular, the longest match is found. To find         ically found, and in particular, the longest match is  found.  To  find
764         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
765         things with callouts.         things with callouts.
766    
767         2.  There is much better support for partial matching. The restrictions         2. Because the alternative algorithm  scans  the  subject  string  just
768         on the content of the pattern that apply when using the standard  algo-         once,  and  never  needs to backtrack, it is possible to pass very long
769         rithm  for  partial matching do not apply to the alternative algorithm.         subject strings to the matching function in  several  pieces,  checking
        For non-anchored patterns, the starting position of a partial match  is  
        available.  
   
        3.  Because  the  alternative  algorithm  scans the subject string just  
        once, and never needs to backtrack, it is possible to  pass  very  long  
        subject  strings  to  the matching function in several pieces, checking  
770         for partial matching each time.         for partial matching each time.
771    
772    
# Line 766  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 774  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
774    
775         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
776    
777         1. It is substantially slower than  the  standard  algorithm.  This  is         1.  It  is  substantially  slower  than the standard algorithm. This is
778         partly  because  it has to search for all possible matches, but is also         partly because it has to search for all possible matches, but  is  also
779         because it is less susceptible to optimization.         because it is less susceptible to optimization.
780    
781         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 785  AUTHOR Line 793  AUTHOR
793    
794  REVISION  REVISION
795    
796         Last updated: 19 April 2008         Last updated: 05 September 2009
797         Copyright (c) 1997-2008 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
798  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
799    
800    
801  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
802    
803    
# Line 897  PCRE API OVERVIEW Line 905  PCRE API OVERVIEW
905         pcre_exec() are used for compiling and matching regular expressions  in         pcre_exec() are used for compiling and matching regular expressions  in
906         a  Perl-compatible  manner. A sample program that demonstrates the sim-         a  Perl-compatible  manner. A sample program that demonstrates the sim-
907         plest way of using them is provided in the file  called  pcredemo.c  in         plest way of using them is provided in the file  called  pcredemo.c  in
908         the  source distribution. The pcresample documentation describes how to         the PCRE source distribution. A listing of this program is given in the
909         compile and run it.         pcredemo documentation, and the pcresample documentation describes  how
910           to compile and run it.
911    
912         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
913         ble,  is  also provided. This uses a different algorithm for the match-         ble, is also provided. This uses a different algorithm for  the  match-
914         ing. The alternative algorithm finds all possible matches (at  a  given         ing.  The  alternative algorithm finds all possible matches (at a given
915         point  in  the subject), and scans the subject just once. However, this         point in the subject), and scans the subject just  once  (unless  there
916         algorithm does not return captured substrings. A description of the two         are  lookbehind  assertions).  However,  this algorithm does not return
917         matching  algorithms and their advantages and disadvantages is given in         captured substrings. A description of the two matching  algorithms  and
918         the pcrematching documentation.         their  advantages  and disadvantages is given in the pcrematching docu-
919           mentation.
920    
921         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
922         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 1133  COMPILING A PATTERN Line 1143  COMPILING A PATTERN
1143    
1144         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
1145         pilation. It should be zero if no options are required.  The  available         pilation. It should be zero if no options are required.  The  available
1146         options  are  described  below. Some of them, in particular, those that         options  are  described  below. Some of them (in particular, those that
1147         are compatible with Perl, can also be set and  unset  from  within  the         are compatible with Perl, but also some others) can  also  be  set  and
1148         pattern  (see  the  detailed  description in the pcrepattern documenta-         unset  from  within  the  pattern  (see the detailed description in the
1149         tion). For these options, the contents of the options  argument  speci-         pcrepattern documentation). For those options that can be different  in
1150         fies  their initial settings at the start of compilation and execution.         different  parts  of  the pattern, the contents of the options argument
1151         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time         specifies their initial settings at the start of compilation and execu-
1152         of matching as well as at compile time.         tion.  The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
1153           time of matching as well as at compile time.
1154    
1155         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1156         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1157         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1158         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
1159         try to free it. The offset from the start of the pattern to the charac-         try to free it. The offset from the start of the pattern to the charac-
1160         ter where the error was discovered is placed in the variable pointed to         ter where the error was discovered is placed in the variable pointed to
1161         by  erroffset,  which must not be NULL. If it is, an immediate error is         by erroffset, which must not be NULL. If it is, an immediate  error  is
1162         given.         given.
1163    
1164         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1165         codeptr  argument is not NULL, a non-zero error code number is returned         codeptr argument is not NULL, a non-zero error code number is  returned
1166         via this argument in the event of an error. This is in addition to  the         via  this argument in the event of an error. This is in addition to the
1167         textual error message. Error codes and messages are listed below.         textual error message. Error codes and messages are listed below.
1168    
1169         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1170         character tables that are  built  when  PCRE  is  compiled,  using  the         character  tables  that  are  built  when  PCRE  is compiled, using the
1171         default  C  locale.  Otherwise, tableptr must be an address that is the         default C locale. Otherwise, tableptr must be an address  that  is  the
1172         result of a call to pcre_maketables(). This value is  stored  with  the         result  of  a  call to pcre_maketables(). This value is stored with the
1173         compiled  pattern,  and used again by pcre_exec(), unless another table         compiled pattern, and used again by pcre_exec(), unless  another  table
1174         pointer is passed to it. For more discussion, see the section on locale         pointer is passed to it. For more discussion, see the section on locale
1175         support below.         support below.
1176    
1177         This  code  fragment  shows a typical straightforward call to pcre_com-         This code fragment shows a typical straightforward  call  to  pcre_com-
1178         pile():         pile():
1179    
1180           pcre *re;           pcre *re;
# Line 1176  COMPILING A PATTERN Line 1187  COMPILING A PATTERN
1187             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
1188             NULL);            /* use default character tables */             NULL);            /* use default character tables */
1189    
1190         The following names for option bits are defined in  the  pcre.h  header         The  following  names  for option bits are defined in the pcre.h header
1191         file:         file:
1192    
1193           PCRE_ANCHORED           PCRE_ANCHORED
1194    
1195         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
1196         is constrained to match only at the first matching point in the  string         is  constrained to match only at the first matching point in the string
1197         that  is being searched (the "subject string"). This effect can also be         that is being searched (the "subject string"). This effect can also  be
1198         achieved by appropriate constructs in the pattern itself, which is  the         achieved  by appropriate constructs in the pattern itself, which is the
1199         only way to do it in Perl.         only way to do it in Perl.
1200    
1201           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
1202    
1203         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
1204         all with number 255, before each pattern item. For  discussion  of  the         all  with  number  255, before each pattern item. For discussion of the
1205         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
1206    
1207           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
1208           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
1209    
1210         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
1211         sequence matches. The choice is either to match only CR, LF,  or  CRLF,         sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1212         or to match any Unicode newline sequence. The default is specified when         or to match any Unicode newline sequence. The default is specified when
1213         PCRE is built. It can be overridden from within the pattern, or by set-         PCRE is built. It can be overridden from within the pattern, or by set-
1214         ting an option when a compiled pattern is matched.         ting an option when a compiled pattern is matched.
1215    
1216           PCRE_CASELESS           PCRE_CASELESS
1217    
1218         If  this  bit is set, letters in the pattern match both upper and lower         If this bit is set, letters in the pattern match both upper  and  lower
1219         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1220         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1221         always understands the concept of case for characters whose values  are         always  understands the concept of case for characters whose values are
1222         less  than 128, so caseless matching is always possible. For characters         less than 128, so caseless matching is always possible. For  characters
1223         with higher values, the concept of case is supported if  PCRE  is  com-         with  higher  values,  the concept of case is supported if PCRE is com-
1224         piled  with Unicode property support, but not otherwise. If you want to         piled with Unicode property support, but not otherwise. If you want  to
1225         use caseless matching for characters 128 and  above,  you  must  ensure         use  caseless  matching  for  characters 128 and above, you must ensure
1226         that  PCRE  is  compiled  with Unicode property support as well as with         that PCRE is compiled with Unicode property support  as  well  as  with
1227         UTF-8 support.         UTF-8 support.
1228    
1229           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
1230    
1231         If this bit is set, a dollar metacharacter in the pattern matches  only         If  this bit is set, a dollar metacharacter in the pattern matches only
1232         at  the  end  of the subject string. Without this option, a dollar also         at the end of the subject string. Without this option,  a  dollar  also
1233         matches immediately before a newline at the end of the string (but  not         matches  immediately before a newline at the end of the string (but not
1234         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1235         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1236         Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1237    
1238           PCRE_DOTALL           PCRE_DOTALL
1239    
1240         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
1241         acters, including those that indicate newline. Without it, a  dot  does         acters,  including  those that indicate newline. Without it, a dot does
1242         not  match  when  the  current position is at a newline. This option is         not match when the current position is at a  newline.  This  option  is
1243         equivalent to Perl's /s option, and it can be changed within a  pattern         equivalent  to Perl's /s option, and it can be changed within a pattern
1244         by  a (?s) option setting. A negative class such as [^a] always matches         by a (?s) option setting. A negative class such as [^a] always  matches
1245         newline characters, independent of the setting of this option.         newline characters, independent of the setting of this option.
1246    
1247           PCRE_DUPNAMES           PCRE_DUPNAMES
1248    
1249         If this bit is set, names used to identify capturing  subpatterns  need         If  this  bit is set, names used to identify capturing subpatterns need
1250         not be unique. This can be helpful for certain types of pattern when it         not be unique. This can be helpful for certain types of pattern when it
1251         is known that only one instance of the named  subpattern  can  ever  be         is  known  that  only  one instance of the named subpattern can ever be
1252         matched.  There  are  more details of named subpatterns below; see also         matched. There are more details of named subpatterns  below;  see  also
1253         the pcrepattern documentation.         the pcrepattern documentation.
1254    
1255           PCRE_EXTENDED           PCRE_EXTENDED
1256    
1257         If this bit is set, whitespace  data  characters  in  the  pattern  are         If  this  bit  is  set,  whitespace  data characters in the pattern are
1258         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1259         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1260         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1261         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1262         option,  and  it  can be changed within a pattern by a (?x) option set-         option, and it can be changed within a pattern by a  (?x)  option  set-
1263         ting.         ting.
1264    
1265         This option makes it possible to include  comments  inside  complicated         This  option  makes  it possible to include comments inside complicated
1266         patterns.   Note,  however,  that this applies only to data characters.         patterns.  Note, however, that this applies only  to  data  characters.
1267         Whitespace  characters  may  never  appear  within  special   character         Whitespace   characters  may  never  appear  within  special  character
1268         sequences  in  a  pattern,  for  example  within the sequence (?( which         sequences in a pattern, for  example  within  the  sequence  (?(  which
1269         introduces a conditional subpattern.         introduces a conditional subpattern.
1270    
1271           PCRE_EXTRA           PCRE_EXTRA
1272    
1273         This option was invented in order to turn on  additional  functionality         This  option  was invented in order to turn on additional functionality
1274         of  PCRE  that  is  incompatible with Perl, but it is currently of very         of PCRE that is incompatible with Perl, but it  is  currently  of  very
1275         little use. When set, any backslash in a pattern that is followed by  a         little  use. When set, any backslash in a pattern that is followed by a
1276         letter  that  has  no  special  meaning causes an error, thus reserving         letter that has no special meaning  causes  an  error,  thus  reserving
1277         these combinations for future expansion. By  default,  as  in  Perl,  a         these  combinations  for  future  expansion.  By default, as in Perl, a
1278         backslash  followed by a letter with no special meaning is treated as a         backslash followed by a letter with no special meaning is treated as  a
1279         literal. (Perl can, however, be persuaded to give a warning for  this.)         literal.  (Perl can, however, be persuaded to give a warning for this.)
1280         There  are  at  present no other features controlled by this option. It         There are at present no other features controlled by  this  option.  It
1281         can also be set by a (?X) option setting within a pattern.         can also be set by a (?X) option setting within a pattern.
1282    
1283           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1284    
1285         If this option is set, an  unanchored  pattern  is  required  to  match         If  this  option  is  set,  an  unanchored pattern is required to match
1286         before  or  at  the  first  newline  in  the subject string, though the         before or at the first  newline  in  the  subject  string,  though  the
1287         matched text may continue over the newline.         matched text may continue over the newline.
1288    
1289           PCRE_JAVASCRIPT_COMPAT           PCRE_JAVASCRIPT_COMPAT
1290    
1291         If this option is set, PCRE's behaviour is changed in some ways so that         If this option is set, PCRE's behaviour is changed in some ways so that
1292         it  is  compatible with JavaScript rather than Perl. The changes are as         it is compatible with JavaScript rather than Perl. The changes  are  as
1293         follows:         follows:
1294    
1295         (1) A lone closing square bracket in a pattern  causes  a  compile-time         (1)  A  lone  closing square bracket in a pattern causes a compile-time
1296         error,  because this is illegal in JavaScript (by default it is treated         error, because this is illegal in JavaScript (by default it is  treated
1297         as a data character). Thus, the pattern AB]CD becomes illegal when this         as a data character). Thus, the pattern AB]CD becomes illegal when this
1298         option is set.         option is set.
1299    
1300         (2)  At run time, a back reference to an unset subpattern group matches         (2) At run time, a back reference to an unset subpattern group  matches
1301         an empty string (by default this causes the current  matching  alterna-         an  empty  string (by default this causes the current matching alterna-
1302         tive  to  fail). A pattern such as (\1)(a) succeeds when this option is         tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1303         set (assuming it can find an "a" in the subject), whereas it  fails  by         set  (assuming  it can find an "a" in the subject), whereas it fails by
1304         default, for Perl compatibility.         default, for Perl compatibility.
1305    
1306           PCRE_MULTILINE           PCRE_MULTILINE
1307    
1308         By  default,  PCRE  treats the subject string as consisting of a single         By default, PCRE treats the subject string as consisting  of  a  single
1309         line of characters (even if it actually contains newlines). The  "start         line  of characters (even if it actually contains newlines). The "start
1310         of  line"  metacharacter  (^)  matches only at the start of the string,         of line" metacharacter (^) matches only at the  start  of  the  string,
1311         while the "end of line" metacharacter ($) matches only at  the  end  of         while  the  "end  of line" metacharacter ($) matches only at the end of
1312         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1313         is set). This is the same as Perl.         is set). This is the same as Perl.
1314    
1315         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1316         constructs  match  immediately following or immediately before internal         constructs match immediately following or immediately  before  internal
1317         newlines in the subject string, respectively, as well as  at  the  very         newlines  in  the  subject string, respectively, as well as at the very
1318         start  and  end.  This is equivalent to Perl's /m option, and it can be         start and end. This is equivalent to Perl's /m option, and  it  can  be
1319         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
1320         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1321         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1322    
1323           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1315  COMPILING A PATTERN Line 1326  COMPILING A PATTERN
1326           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1327           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1328    
1329         These options override the default newline definition that  was  chosen         These  options  override the default newline definition that was chosen
1330         when  PCRE  was built. Setting the first or the second specifies that a         when PCRE was built. Setting the first or the second specifies  that  a
1331         newline is indicated by a single character (CR  or  LF,  respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1332         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1333         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1334         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
1335         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1336         recognized. The Unicode newline sequences are the three just mentioned,         recognized. The Unicode newline sequences are the three just mentioned,
1337         plus the single characters VT (vertical  tab,  U+000B),  FF  (formfeed,         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1338         U+000C),  NEL  (next line, U+0085), LS (line separator, U+2028), and PS         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1339         (paragraph separator, U+2029). The last  two  are  recognized  only  in         (paragraph  separator,  U+2029).  The  last  two are recognized only in
1340         UTF-8 mode.         UTF-8 mode.
1341    
1342         The  newline  setting  in  the  options  word  uses three bits that are         The newline setting in the  options  word  uses  three  bits  that  are
1343         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
1344         used  (default  plus the five values above). This means that if you set         used (default plus the five values above). This means that if  you  set
1345         more than one newline option, the combination may or may not be  sensi-         more  than one newline option, the combination may or may not be sensi-
1346         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1347         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1348         cause an error.         cause an error.
1349    
1350         The  only time that a line break is specially recognized when compiling         The only time that a line break is specially recognized when  compiling
1351         a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1352         character  class  is  encountered.  This indicates a comment that lasts         character class is encountered. This indicates  a  comment  that  lasts
1353         until after the next line break sequence. In other circumstances,  line         until  after the next line break sequence. In other circumstances, line
1354         break   sequences   are   treated  as  literal  data,  except  that  in         break  sequences  are  treated  as  literal  data,   except   that   in
1355         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1356         and are therefore ignored.         and are therefore ignored.
1357    
# Line 1350  COMPILING A PATTERN Line 1361  COMPILING A PATTERN
1361           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1362    
1363         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
1364         theses  in the pattern. Any opening parenthesis that is not followed by         theses in the pattern. Any opening parenthesis that is not followed  by
1365         ? behaves as if it were followed by ?: but named parentheses can  still         ?  behaves as if it were followed by ?: but named parentheses can still
1366         be  used  for  capturing  (and  they acquire numbers in the usual way).         be used for capturing (and they acquire  numbers  in  the  usual  way).
1367         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1368    
1369           PCRE_UNGREEDY           PCRE_UNGREEDY
1370    
1371         This option inverts the "greediness" of the quantifiers  so  that  they         This  option  inverts  the "greediness" of the quantifiers so that they
1372         are  not greedy by default, but become greedy if followed by "?". It is         are not greedy by default, but become greedy if followed by "?". It  is
1373         not compatible with Perl. It can also be set by a (?U)  option  setting         not  compatible  with Perl. It can also be set by a (?U) option setting
1374         within the pattern.         within the pattern.
1375    
1376           PCRE_UTF8           PCRE_UTF8
1377    
1378         This  option  causes PCRE to regard both the pattern and the subject as         This option causes PCRE to regard both the pattern and the  subject  as
1379         strings of UTF-8 characters instead of single-byte  character  strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1380         However,  it is available only when PCRE is built to include UTF-8 sup-         However, it is available only when PCRE is built to include UTF-8  sup-
1381         port. If not, the use of this option provokes an error. Details of  how         port.  If not, the use of this option provokes an error. Details of how
1382         this  option  changes the behaviour of PCRE are given in the section on         this option changes the behaviour of PCRE are given in the  section  on
1383         UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1384    
1385           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1386    
1387         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1388         automatically  checked.  There  is  a  discussion about the validity of         automatically checked. There is a  discussion  about  the  validity  of
1389         UTF-8 strings in the main pcre page. If an invalid  UTF-8  sequence  of         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1390         bytes  is  found,  pcre_compile() returns an error. If you already know         bytes is found, pcre_compile() returns an error. If  you  already  know
1391         that your pattern is valid, and you want to skip this check for perfor-         that your pattern is valid, and you want to skip this check for perfor-
1392         mance  reasons,  you  can set the PCRE_NO_UTF8_CHECK option. When it is         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1393         set, the effect of passing an invalid UTF-8  string  as  a  pattern  is         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1394         undefined.  It  may  cause your program to crash. Note that this option         undefined. It may cause your program to crash. Note  that  this  option
1395         can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress  the         can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1396         UTF-8 validity checking of subject strings.         UTF-8 validity checking of subject strings.
1397    
1398    
1399  COMPILATION ERROR CODES  COMPILATION ERROR CODES
1400    
1401         The  following  table  lists  the  error  codes than may be returned by         The following table lists the error  codes  than  may  be  returned  by
1402         pcre_compile2(), along with the error messages that may be returned  by         pcre_compile2(),  along with the error messages that may be returned by
1403         both  compiling functions. As PCRE has developed, some error codes have         both compiling functions. As PCRE has developed, some error codes  have
1404         fallen out of use. To avoid confusion, they have not been re-used.         fallen out of use. To avoid confusion, they have not been re-used.
1405    
1406            0  no error            0  no error
# Line 1445  COMPILATION ERROR CODES Line 1456  COMPILATION ERROR CODES
1456           50  [this code is not in use]           50  [this code is not in use]
1457           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 (not in UTF-8 mode)
1458           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
1459           53  internal  error:  previously-checked  referenced  subpattern  not           53   internal  error:  previously-checked  referenced  subpattern not
1460         found         found
1461           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
1462           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
# Line 1460  COMPILATION ERROR CODES Line 1471  COMPILATION ERROR CODES
1471           63  digit expected after (?+           63  digit expected after (?+
1472           64  ] is an invalid data character in JavaScript compatibility mode           64  ] is an invalid data character in JavaScript compatibility mode
1473    
1474         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different         The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1475         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
1476    
1477    
# Line 1469  STUDYING A PATTERN Line 1480  STUDYING A PATTERN
1480         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1481              const char **errptr);              const char **errptr);
1482    
1483         If a compiled pattern is going to be used several times,  it  is  worth         If  a  compiled  pattern is going to be used several times, it is worth
1484         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1485         matching. The function pcre_study() takes a pointer to a compiled  pat-         matching.  The function pcre_study() takes a pointer to a compiled pat-
1486         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1487         information that will help speed up matching,  pcre_study()  returns  a         information  that  will  help speed up matching, pcre_study() returns a
1488         pointer  to a pcre_extra block, in which the study_data field points to         pointer to a pcre_extra block, in which the study_data field points  to
1489         the results of the study.         the results of the study.
1490    
1491         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1492         pcre_exec().  However,  a  pcre_extra  block also contains other fields         pcre_exec(). However, a pcre_extra block  also  contains  other  fields
1493         that can be set by the caller before the block  is  passed;  these  are         that  can  be  set  by the caller before the block is passed; these are
1494         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1495    
1496         If  studying  the  pattern  does not produce any additional information         If studying the pattern does not  produce  any  additional  information
1497         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1498         wants  to  pass  any of the other fields to pcre_exec(), it must set up         wants to pass any of the other fields to pcre_exec(), it  must  set  up
1499         its own pcre_extra block.         its own pcre_extra block.
1500    
1501         The second argument of pcre_study() contains option bits.  At  present,         The  second  argument of pcre_study() contains option bits. At present,
1502         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1503    
1504         The  third argument for pcre_study() is a pointer for an error message.         The third argument for pcre_study() is a pointer for an error  message.
1505         If studying succeeds (even if no data is  returned),  the  variable  it         If  studying  succeeds  (even  if no data is returned), the variable it
1506         points  to  is  set  to NULL. Otherwise it is set to point to a textual         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1507         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1508         must  not  try  to  free it. You should test the error pointer for NULL         must not try to free it. You should test the  error  pointer  for  NULL
1509         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1510    
1511         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 1506  STUDYING A PATTERN Line 1517  STUDYING A PATTERN
1517             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1518    
1519         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1520         that  do not have a single fixed starting character. A bitmap of possi-         that do not have a single fixed starting character. A bitmap of  possi-
1521         ble starting bytes is created.         ble starting bytes is created.
1522    
1523    
1524  LOCALE SUPPORT  LOCALE SUPPORT
1525    
1526         PCRE handles caseless matching, and determines whether  characters  are         PCRE  handles  caseless matching, and determines whether characters are
1527         letters,  digits, or whatever, by reference to a set of tables, indexed         letters, digits, or whatever, by reference to a set of tables,  indexed
1528         by character value. When running in UTF-8 mode, this  applies  only  to         by  character  value.  When running in UTF-8 mode, this applies only to
1529         characters  with  codes  less than 128. Higher-valued codes never match         characters with codes less than 128. Higher-valued  codes  never  match
1530         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1531         with  Unicode  character property support. The use of locales with Uni-         with Unicode character property support. The use of locales  with  Uni-
1532         code is discouraged. If you are handling characters with codes  greater         code  is discouraged. If you are handling characters with codes greater
1533         than  128, you should either use UTF-8 and Unicode, or use locales, but         than 128, you should either use UTF-8 and Unicode, or use locales,  but
1534         not try to mix the two.         not try to mix the two.
1535    
1536         PCRE contains an internal set of tables that are used  when  the  final         PCRE  contains  an  internal set of tables that are used when the final
1537         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1538         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
1539         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
1540         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
1541         which may cause them to be different.         which may cause them to be different.
1542    
1543         The  internal tables can always be overridden by tables supplied by the         The internal tables can always be overridden by tables supplied by  the
1544         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
1545         from  the  default.  As more and more applications change to using Uni-         from the default. As more and more applications change  to  using  Uni-
1546         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
1547    
1548         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
1549         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
1550         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1551         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
1552         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
1553         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1554    
1555           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1556           tables = pcre_maketables();           tables = pcre_maketables();
1557           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1558    
1559         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1560         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
1561    
1562         When pcre_maketables() runs, the tables are built  in  memory  that  is         When  pcre_maketables()  runs,  the  tables are built in memory that is
1563         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1564         that the memory containing the tables remains available for as long  as         that  the memory containing the tables remains available for as long as
1565         it is needed.         it is needed.
1566    
1567         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1568         pattern, and the same tables are used via this pointer by  pcre_study()         pattern,  and the same tables are used via this pointer by pcre_study()
1569         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1570         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1571         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1572    
1573         It  is  possible to pass a table pointer or NULL (indicating the use of         It is possible to pass a table pointer or NULL (indicating the  use  of
1574         the internal tables) to pcre_exec(). Although  not  intended  for  this         the  internal  tables)  to  pcre_exec(). Although not intended for this
1575         purpose,  this facility could be used to match a pattern in a different         purpose, this facility could be used to match a pattern in a  different
1576         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1577         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1578    
# Line 1571  INFORMATION ABOUT A PATTERN Line 1582  INFORMATION ABOUT A PATTERN
1582         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1583              int what, void *where);              int what, void *where);
1584    
1585         The  pcre_fullinfo() function returns information about a compiled pat-         The pcre_fullinfo() function returns information about a compiled  pat-
1586         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1587         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1588    
1589         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
1590         pattern. The second argument is the result of pcre_study(), or NULL  if         pattern.  The second argument is the result of pcre_study(), or NULL if
1591         the  pattern  was not studied. The third argument specifies which piece         the pattern was not studied. The third argument specifies  which  piece
1592         of information is required, and the fourth argument is a pointer  to  a         of  information  is required, and the fourth argument is a pointer to a
1593         variable  to  receive  the  data. The yield of the function is zero for         variable to receive the data. The yield of the  function  is  zero  for
1594         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1595    
1596           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1587  INFORMATION ABOUT A PATTERN Line 1598  INFORMATION ABOUT A PATTERN
1598           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1599           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1600    
1601         The "magic number" is placed at the start of each compiled  pattern  as         The  "magic  number" is placed at the start of each compiled pattern as
1602         an  simple check against passing an arbitrary memory pointer. Here is a         an simple check against passing an arbitrary memory pointer. Here is  a
1603         typical call of pcre_fullinfo(), to obtain the length of  the  compiled         typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1604         pattern:         pattern:
1605    
1606           int rc;           int rc;
# Line 1600  INFORMATION ABOUT A PATTERN Line 1611  INFORMATION ABOUT A PATTERN
1611             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1612             &length);         /* where to put the data */             &length);         /* where to put the data */
1613    
1614         The  possible  values for the third argument are defined in pcre.h, and         The possible values for the third argument are defined in  pcre.h,  and
1615         are as follows:         are as follows:
1616    
1617           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1618    
1619         Return the number of the highest back reference  in  the  pattern.  The         Return  the  number  of  the highest back reference in the pattern. The
1620         fourth  argument  should  point to an int variable. Zero is returned if         fourth argument should point to an int variable. Zero  is  returned  if
1621         there are no back references.         there are no back references.
1622    
1623           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1624    
1625         Return the number of capturing subpatterns in the pattern.  The  fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
1626         argument should point to an int variable.         argument should point to an int variable.
1627    
1628           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1629    
1630         Return  a pointer to the internal default character tables within PCRE.         Return a pointer to the internal default character tables within  PCRE.
1631         The fourth argument should point to an unsigned char *  variable.  This         The  fourth  argument should point to an unsigned char * variable. This
1632         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1633         tion. External callers can cause PCRE to use  its  internal  tables  by         tion.  External  callers  can  cause PCRE to use its internal tables by
1634         passing a NULL table pointer.         passing a NULL table pointer.
1635    
1636           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1637    
1638         Return  information  about  the first byte of any matched string, for a         Return information about the first byte of any matched  string,  for  a
1639         non-anchored pattern. The fourth argument should point to an int  vari-         non-anchored  pattern. The fourth argument should point to an int vari-
1640         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1641         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1642    
1643         If there is a fixed first byte, for example, from  a  pattern  such  as         If  there  is  a  fixed first byte, for example, from a pattern such as
1644         (cat|cow|coyote), its value is returned. Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1645    
1646         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1647         branch starts with "^", or         branch starts with "^", or
1648    
1649         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1650         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1651    
1652         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
1653         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
1654         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1655    
1656           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1657    
1658         If  the pattern was studied, and this resulted in the construction of a         If the pattern was studied, and this resulted in the construction of  a
1659         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1660         matching  string, a pointer to the table is returned. Otherwise NULL is         matching string, a pointer to the table is returned. Otherwise NULL  is
1661         returned. The fourth argument should point to an unsigned char *  vari-         returned.  The fourth argument should point to an unsigned char * vari-
1662         able.         able.
1663    
1664           PCRE_INFO_HASCRORLF           PCRE_INFO_HASCRORLF
1665    
1666         Return  1  if  the  pattern  contains any explicit matches for CR or LF         Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1667         characters, otherwise 0. The fourth argument should  point  to  an  int         characters,  otherwise  0.  The  fourth argument should point to an int
1668         variable.  An explicit match is either a literal CR or LF character, or         variable. An explicit match is either a literal CR or LF character,  or
1669         \r or \n.         \r or \n.
1670    
1671           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
1672    
1673         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,         Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1674         otherwise  0. The fourth argument should point to an int variable. (?J)         otherwise 0. The fourth argument should point to an int variable.  (?J)
1675         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1676    
1677           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1678    
1679         Return the value of the rightmost literal byte that must exist  in  any         Return  the  value of the rightmost literal byte that must exist in any
1680         matched  string,  other  than  at  its  start,  if such a byte has been         matched string, other than at its  start,  if  such  a  byte  has  been
1681         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1682         is  no such byte, -1 is returned. For anchored patterns, a last literal         is no such byte, -1 is returned. For anchored patterns, a last  literal
1683         byte is recorded only if it follows something of variable  length.  For         byte  is  recorded only if it follows something of variable length. For
1684         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1685         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1686    
# Line 1677  INFORMATION ABOUT A PATTERN Line 1688  INFORMATION ABOUT A PATTERN
1688           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1689           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1690    
1691         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1692         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1693         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
1694         pcre_get_named_substring()  are  provided  for extracting captured sub-         pcre_get_named_substring() are provided for  extracting  captured  sub-
1695         strings by name. It is also possible to extract the data  directly,  by         strings  by  name. It is also possible to extract the data directly, by
1696         first  converting  the  name to a number in order to access the correct         first converting the name to a number in order to  access  the  correct
1697         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
1698         the  conversion,  you  need  to  use  the  name-to-number map, which is         the conversion, you need  to  use  the  name-to-number  map,  which  is
1699         described by these three values.         described by these three values.
1700    
1701         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1702         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1703         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
1704         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1705         a pointer to the first entry of the table  (a  pointer  to  char).  The         a  pointer  to  the  first  entry of the table (a pointer to char). The
1706         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1707         sis, most significant byte first. The rest of the entry is  the  corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1708         sponding  name,  zero  terminated. The names are in alphabetical order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1709         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1710         theses  numbers.  For  example,  consider the following pattern (assume         theses numbers. For example, consider  the  following  pattern  (assume
1711         PCRE_EXTENDED is  set,  so  white  space  -  including  newlines  -  is         PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1712         ignored):         ignored):
1713    
1714           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1715           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1716    
1717         There  are  four  named subpatterns, so the table has four entries, and         There are four named subpatterns, so the table has  four  entries,  and
1718         each entry in the table is eight bytes long. The table is  as  follows,         each  entry  in the table is eight bytes long. The table is as follows,
1719         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1720         as ??:         as ??:
1721    
# Line 1713  INFORMATION ABOUT A PATTERN Line 1724  INFORMATION ABOUT A PATTERN
1724           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1725           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1726    
1727         When writing code to extract data  from  named  subpatterns  using  the         When  writing  code  to  extract  data from named subpatterns using the
1728         name-to-number  map,  remember that the length of the entries is likely         name-to-number map, remember that the length of the entries  is  likely
1729         to be different for each compiled pattern.         to be different for each compiled pattern.
1730    
1731           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
1732    
1733         Return 1 if the pattern can be used for partial matching, otherwise  0.         Return  1  if  the  pattern  can  be  used  for  partial  matching with
1734         The  fourth  argument  should point to an int variable. The pcrepartial         pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
1735         documentation lists the restrictions that apply to patterns  when  par-         variable.  From  release  8.00,  this  always  returns  1,  because the
1736         tial matching is used.         restrictions that previously applied  to  partial  matching  have  been
1737           lifted.  The  pcrepartial documentation gives details of partial match-
1738           ing.
1739    
1740           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1741    
1742         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1743         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1744         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1745         by any top-level option settings at the start of the pattern itself. In         by any top-level option settings at the start of the pattern itself. In
1746         other  words,  they are the options that will be in force when matching         other words, they are the options that will be in force  when  matching
1747         starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with         starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1748         the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,         the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1749         and PCRE_EXTENDED.         and PCRE_EXTENDED.
1750    
1751         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1752         alternatives begin with one of the following:         alternatives begin with one of the following:
1753    
1754           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1749  INFORMATION ABOUT A PATTERN Line 1762  INFORMATION ABOUT A PATTERN
1762    
1763           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1764    
1765         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1766         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1767         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1768         size_t variable.         size_t variable.
# Line 1757  INFORMATION ABOUT A PATTERN Line 1770  INFORMATION ABOUT A PATTERN
1770           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1771    
1772         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1773         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1774         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1775         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1776         variable.         variable.
1777    
1778    
# Line 1767  OBSOLETE INFO FUNCTION Line 1780  OBSOLETE INFO FUNCTION
1780    
1781         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1782    
1783         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1784         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1785         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1786         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1787         lowing negative numbers:         lowing negative numbers:
1788    
1789           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1790           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1791    
1792         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1793         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1794         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1795    
1796         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1797         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1798         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1799    
1800    
# Line 1789  REFERENCE COUNTS Line 1802  REFERENCE COUNTS
1802    
1803         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1804    
1805         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1806         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1807         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1808         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1809         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1810    
1811         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1812         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1813         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1814         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1815         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1816         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1817    
1818         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1819         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1820         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1821    
1822    
# Line 1813  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1826  MATCHING A PATTERN: THE TRADITIONAL FUNC
1826              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1827              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1828    
1829         The function pcre_exec() is called to match a subject string against  a         The  function pcre_exec() is called to match a subject string against a
1830         compiled  pattern, which is passed in the code argument. If the pattern         compiled pattern, which is passed in the code argument. If the  pattern
1831         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1832         argument.  This  function is the main matching facility of the library,         argument. This function is the main matching facility of  the  library,
1833         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1834         an  alternative matching function, which is described below in the sec-         an alternative matching function, which is described below in the  sec-
1835         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1836    
1837         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
1838         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
1839         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1840         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
1841         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1842    
1843         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1843  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1856  MATCHING A PATTERN: THE TRADITIONAL FUNC
1856    
1857     Extra data for pcre_exec()     Extra data for pcre_exec()
1858    
1859         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1860         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1861         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
1862         tional  information  in it. The pcre_extra block contains the following         tional information in it. The pcre_extra block contains  the  following
1863         fields (not necessarily in this order):         fields (not necessarily in this order):
1864    
1865           unsigned long int flags;           unsigned long int flags;
# Line 1856  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1869  MATCHING A PATTERN: THE TRADITIONAL FUNC
1869           void *callout_data;           void *callout_data;
1870           const unsigned char *tables;           const unsigned char *tables;
1871    
1872         The flags field is a bitmap that specifies which of  the  other  fields         The  flags  field  is a bitmap that specifies which of the other fields
1873         are set. The flag bits are:         are set. The flag bits are:
1874    
1875           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1865  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1878  MATCHING A PATTERN: THE TRADITIONAL FUNC
1878           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1879           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1880    
1881         Other  flag  bits should be set to zero. The study_data field is set in         Other flag bits should be set to zero. The study_data field is  set  in
1882         the pcre_extra block that is returned by  pcre_study(),  together  with         the  pcre_extra  block  that is returned by pcre_study(), together with
1883         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1884         add to the block by setting the other fields  and  their  corresponding         add  to  the  block by setting the other fields and their corresponding
1885         flag bits.         flag bits.
1886    
1887         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1888         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
1889         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
1890         search trees. The classic  example  is  the  use  of  nested  unlimited         search  trees.  The  classic  example  is  the  use of nested unlimited
1891         repeats.         repeats.
1892    
1893         Internally,  PCRE uses a function called match() which it calls repeat-         Internally, PCRE uses a function called match() which it calls  repeat-
1894         edly (sometimes recursively). The limit set by match_limit  is  imposed         edly  (sometimes  recursively). The limit set by match_limit is imposed
1895         on  the  number  of times this function is called during a match, which         on the number of times this function is called during  a  match,  which
1896         has the effect of limiting the amount of  backtracking  that  can  take         has  the  effect  of  limiting the amount of backtracking that can take
1897         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1898         for each position in the subject string.         for each position in the subject string.
1899    
1900         The default value for the limit can be set  when  PCRE  is  built;  the         The  default  value  for  the  limit can be set when PCRE is built; the
1901         default  default  is 10 million, which handles all but the most extreme         default default is 10 million, which handles all but the  most  extreme
1902         cases. You can override the default  by  suppling  pcre_exec()  with  a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1903         pcre_extra     block    in    which    match_limit    is    set,    and         pcre_extra    block    in    which    match_limit    is    set,     and
1904         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1905         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1906    
1907         The  match_limit_recursion field is similar to match_limit, but instead         The match_limit_recursion field is similar to match_limit, but  instead
1908         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1909         the  depth  of  recursion. The recursion depth is a smaller number than         the depth of recursion. The recursion depth is a  smaller  number  than
1910         the total number of calls, because not all calls to match() are  recur-         the  total number of calls, because not all calls to match() are recur-
1911         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1912    
1913         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1914         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1915         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1916    
1917         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1918         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1919         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1920         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1921         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1922         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1923    
1924         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  callout_data  field is used in conjunction with the "callout" fea-
1925         ture, which is described in the pcrecallout documentation.         ture, and is described in the pcrecallout documentation.
1926    
1927         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1928         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1929         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1930         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1931         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1932         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1933         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1934         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1935         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1936         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1937    
1938     Option bits for pcre_exec()     Option bits for pcre_exec()
1939    
1940         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1941         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1942         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,    PCRE_NO_START_OPTIMIZE,         PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
1943         PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.         PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and
1944           PCRE_PARTIAL_HARD.
1945    
1946           PCRE_ANCHORED           PCRE_ANCHORED
1947    
# Line 2007  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2021  MATCHING A PATTERN: THE TRADITIONAL FUNC
2021    
2022           a?b?           a?b?
2023    
2024         is applied to a string not beginning with "a" or "b",  it  matches  the         is applied to a string not beginning with "a" or  "b",  it  matches  an
2025         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
2026         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
2027         rences of "a" or "b".         rences of "a" or "b".
2028    
2029         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-           PCRE_NOTEMPTY_ATSTART
2030         cial case of a pattern match of the empty  string  within  its  split()  
2031         function,  and  when  using  the /g modifier. It is possible to emulate         This  is  like PCRE_NOTEMPTY, except that an empty string match that is
2032         Perl's behaviour after matching a null string by first trying the match         not at the start of  the  subject  is  permitted.  If  the  pattern  is
2033         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         anchored, such a match can occur only if the pattern contains \K.
2034         if that fails by advancing the starting offset (see below)  and  trying  
2035         an ordinary match again. There is some code that demonstrates how to do         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
2036         this in the pcredemo.c sample program.         PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
2037           match  of  the empty string within its split() function, and when using
2038           the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
2039           matching a null string by first trying the match again at the same off-
2040           set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
2041           fails, by advancing the starting offset (see below) and trying an ordi-
2042           nary match again. There is some code that demonstrates how to  do  this
2043           in the pcredemo sample program.
2044    
2045           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2046    
2047         There are a number of optimizations that pcre_exec() uses at the  start         There  are a number of optimizations that pcre_exec() uses at the start
2048         of  a  match,  in  order to speed up the process. For example, if it is         of a match, in order to speed up the process. For  example,  if  it  is
2049         known that a match must start with a specific  character,  it  searches         known  that  a  match must start with a specific character, it searches
2050         the subject for that character, and fails immediately if it cannot find         the subject for that character, and fails immediately if it cannot find
2051         it, without actually running the main matching function. When  callouts         it,  without actually running the main matching function. When callouts
2052         are  in  use,  these  optimizations  can cause them to be skipped. This         are in use, these optimizations can cause  them  to  be  skipped.  This
2053         option disables the "start-up" optimizations,  causing  performance  to         option  disables  the  "start-up" optimizations, causing performance to
2054         suffer, but ensuring that the callouts do occur.         suffer, but ensuring that the callouts do occur.
2055    
2056           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2057    
2058         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2059         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2060         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
2061         points to the start of a UTF-8 character. There is a  discussion  about         points  to  the start of a UTF-8 character. There is a discussion about
2062         the  validity  of  UTF-8 strings in the section on UTF-8 support in the         the validity of UTF-8 strings in the section on UTF-8  support  in  the
2063         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2064         pcre_exec()  returns  the error PCRE_ERROR_BADUTF8. If startoffset con-         pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
2065         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
2066    
2067         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
2068         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
2069         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2070         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
2071         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
2072         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
2073         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
2074         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
2075         value of startoffset that does not point to the start of a UTF-8  char-         value  of startoffset that does not point to the start of a UTF-8 char-
2076         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
2077    
2078           PCRE_PARTIAL           PCRE_PARTIAL_HARD
2079             PCRE_PARTIAL_SOFT
2080    
2081         This  option  turns  on  the  partial  matching feature. If the subject         These options turn on the partial matching feature. For backwards  com-
2082         string fails to match the pattern, but at some point during the  match-         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2083         ing  process  the  end of the subject was reached (that is, the subject         match occurs if the end of the subject string is reached  successfully,
2084         partially matches the pattern and the failure to  match  occurred  only         but  there  are not enough subject characters to complete the match. If
2085         because  there were not enough subject characters), pcre_exec() returns         this happens when PCRE_PARTIAL_HARD  is  set,  pcre_exec()  immediately
2086         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,
2087         used,  there  are restrictions on what may appear in the pattern. These         matching continues by testing any other alternatives. Only if they  all
2088         are discussed in the pcrepartial documentation.         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).
2089           The portion of the string that was inspected when the partial match was
2090           found  is  set  as  the first matching string. There is a more detailed
2091           discussion in the pcrepartial documentation.
2092    
2093     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2094    
# Line 2249  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2274  MATCHING A PATTERN: THE TRADITIONAL FUNC
2274    
2275           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2276    
2277         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         This code is no longer in  use.  It  was  formerly  returned  when  the
2278         items  that are not supported for partial matching. See the pcrepartial         PCRE_PARTIAL  option  was used with a compiled pattern containing items
2279         documentation for details of partial matching.         that were  not  supported  for  partial  matching.  From  release  8.00
2280           onwards, there are no restrictions on partial matching.
2281    
2282           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2283    
2284         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
2285         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2286    
2287           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
# Line 2265  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2291  MATCHING A PATTERN: THE TRADITIONAL FUNC
2291           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2292    
2293         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2294         field in a pcre_extra structure (or defaulted)  was  reached.  See  the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2295         description above.         description above.
2296    
2297           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2288  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2314  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2314         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2315              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2316    
2317         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2318         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2319         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2320         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2321         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2322         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2323         substrings.         substrings.
2324    
2325         A  substring that contains a binary zero is correctly extracted and has         A substring that contains a binary zero is correctly extracted and  has
2326         a further zero added on the end, but the result is not, of course, a  C         a  further zero added on the end, but the result is not, of course, a C
2327         string.   However,  you  can  process such a string by referring to the         string.  However, you can process such a string  by  referring  to  the
2328         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2329         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2330         not adequate for handling strings containing binary zeros, because  the         not  adequate for handling strings containing binary zeros, because the
2331         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2332    
2333         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2334         tions: subject is the subject string that has  just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2335         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2336         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2337         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2338         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2339         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2340         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2341         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2342    
2343         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2344         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2345         zero  extracts  the  substring that matched the entire pattern, whereas         zero extracts the substring that matched the  entire  pattern,  whereas
2346         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2347         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2348         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2349         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2350         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2351         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2352    
2353           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2354    
2355         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2356         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2357    
2358           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2359    
2360         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2361    
2362         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2363         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2364         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2365         the  memory  block  is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2366         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2367         pointer.  The  yield  of  the function is zero if all went well, or the         pointer. The yield of the function is zero if all  went  well,  or  the
2368         error code         error code
2369    
2370           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2371    
2372         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2373    
2374         When any of these functions encounter a substring that is unset,  which         When  any of these functions encounter a substring that is unset, which
2375         can  happen  when  capturing subpattern number n+1 matches some part of         can happen when capturing subpattern number n+1 matches  some  part  of
2376         the subject, but subpattern n has not been used at all, they return  an         the  subject, but subpattern n has not been used at all, they return an
2377         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2378         string by inspecting the appropriate offset in ovector, which is  nega-         string  by inspecting the appropriate offset in ovector, which is nega-
2379         tive for unset substrings.         tive for unset substrings.
2380    
2381         The  two convenience functions pcre_free_substring() and pcre_free_sub-         The two convenience functions pcre_free_substring() and  pcre_free_sub-
2382         string_list() can be used to free the memory  returned  by  a  previous         string_list()  can  be  used  to free the memory returned by a previous
2383         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2384         tively. They do nothing more than  call  the  function  pointed  to  by         tively.  They  do  nothing  more  than  call the function pointed to by
2385         pcre_free,  which  of course could be called directly from a C program.         pcre_free, which of course could be called directly from a  C  program.
2386         However, PCRE is used in some situations where it is linked via a  spe-         However,  PCRE is used in some situations where it is linked via a spe-
2387         cial   interface  to  another  programming  language  that  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
2388         pcre_free directly; it is for these cases that the functions  are  pro-         pcre_free  directly;  it is for these cases that the functions are pro-
2389         vided.         vided.
2390    
2391    
# Line 2378  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2404  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2404              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2405              const char **stringptr);              const char **stringptr);
2406    
2407         To  extract a substring by name, you first have to find associated num-         To extract a substring by name, you first have to find associated  num-
2408         ber.  For example, for this pattern         ber.  For example, for this pattern
2409    
2410           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2387  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2413  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2413         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2414         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2415         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2416         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2417         subpattern of that name.         subpattern of that name.
2418    
2419         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2420         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2421         are also two functions that do the whole job.         are also two functions that do the whole job.
2422    
2423         Most    of    the    arguments   of   pcre_copy_named_substring()   and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2424         pcre_get_named_substring() are the same  as  those  for  the  similarly         pcre_get_named_substring()  are  the  same  as  those for the similarly
2425         named  functions  that extract by number. As these are described in the         named functions that extract by number. As these are described  in  the
2426         previous section, they are not re-described here. There  are  just  two         previous  section,  they  are not re-described here. There are just two
2427         differences:         differences:
2428    
2429         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2430         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2431         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2432         name-to-number translation table.         name-to-number translation table.
2433    
2434         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2435         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2436         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2437         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
2438    
2439         Warning:  If the pattern uses the "(?|" feature to set up multiple sub-         Warning: If the pattern uses the "(?|" feature to set up multiple  sub-
2440         patterns with the same number, you  cannot  use  names  to  distinguish         patterns  with  the  same  number,  you cannot use names to distinguish
2441         them, because names are not included in the compiled code. The matching         them, because names are not included in the compiled code. The matching
2442         process uses only numbers.         process uses only numbers.
2443    
# Line 2421  DUPLICATE SUBPATTERN NAMES Line 2447  DUPLICATE SUBPATTERN NAMES
2447         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2448              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2449    
2450         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2451         subpatterns  are  not  required  to  be unique. Normally, patterns with         subpatterns are not required to  be  unique.  Normally,  patterns  with
2452         duplicate names are such that in any one match, only one of  the  named         duplicate  names  are such that in any one match, only one of the named
2453         subpatterns  participates. An example is shown in the pcrepattern docu-         subpatterns participates. An example is shown in the pcrepattern  docu-
2454         mentation.         mentation.
2455    
2456         When   duplicates   are   present,   pcre_copy_named_substring()    and         When    duplicates   are   present,   pcre_copy_named_substring()   and
2457         pcre_get_named_substring()  return the first substring corresponding to         pcre_get_named_substring() return the first substring corresponding  to
2458         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING         the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
2459         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()         (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
2460         function returns one of the numbers that are associated with the  name,         function  returns one of the numbers that are associated with the name,
2461         but it is not defined which it is.         but it is not defined which it is.
2462    
2463         If  you want to get full details of all captured substrings for a given         If you want to get full details of all captured substrings for a  given
2464         name, you must use  the  pcre_get_stringtable_entries()  function.  The         name,  you  must  use  the pcre_get_stringtable_entries() function. The
2465         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2466         third and fourth are pointers to variables which  are  updated  by  the         third  and  fourth  are  pointers to variables which are updated by the
2467         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2468         the name-to-number table  for  the  given  name.  The  function  itself         the  name-to-number  table  for  the  given  name.  The function itself
2469         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
2470         there are none. The format of the table is described above in the  sec-         there  are none. The format of the table is described above in the sec-
2471         tion  entitled  Information  about  a  pattern.  Given all the relevant         tion entitled Information about a  pattern.   Given  all  the  relevant
2472         entries for the name, you can extract each of their numbers, and  hence         entries  for the name, you can extract each of their numbers, and hence
2473         the captured data, if any.         the captured data, if any.
2474    
2475    
2476  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2477    
2478         The  traditional  matching  function  uses a similar algorithm to Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
2479         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2480         the  subject.  If you want to find all possible matches, or the longest         the subject. If you want to find all possible matches, or  the  longest
2481         possible match, consider using the alternative matching  function  (see         possible  match,  consider using the alternative matching function (see
2482         below)  instead.  If you cannot use the alternative function, but still         below) instead. If you cannot use the alternative function,  but  still
2483         need to find all possible matches, you can kludge it up by  making  use         need  to  find all possible matches, you can kludge it up by making use
2484         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2485         tation.         tation.
2486    
2487         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2488         tern.   When your callout function is called, extract and save the cur-         tern.  When your callout function is called, extract and save the  cur-
2489         rent matched substring. Then return  1,  which  forces  pcre_exec()  to         rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2490         backtrack  and  try other alternatives. Ultimately, when it runs out of         backtrack and try other alternatives. Ultimately, when it runs  out  of
2491         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2492    
2493    
# Line 2472  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2498  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2498              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2499              int *workspace, int wscount);              int *workspace, int wscount);
2500    
2501         The function pcre_dfa_exec()  is  called  to  match  a  subject  string         The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2502         against  a  compiled pattern, using a matching algorithm that scans the         against a compiled pattern, using a matching algorithm that  scans  the
2503         subject string just once, and does not backtrack.  This  has  different         subject  string  just  once, and does not backtrack. This has different
2504         characteristics  to  the  normal  algorithm, and is not compatible with         characteristics to the normal algorithm, and  is  not  compatible  with
2505         Perl. Some of the features of PCRE patterns are not  supported.  Never-         Perl.  Some  of the features of PCRE patterns are not supported. Never-
2506         theless,  there are times when this kind of matching can be useful. For         theless, there are times when this kind of matching can be useful.  For
2507         a discussion of the two matching algorithms, see the pcrematching docu-         a  discussion  of  the  two matching algorithms, and a list of features
2508         mentation.         that pcre_dfa_exec() does not support, see the pcrematching  documenta-
2509           tion.
2510    
2511         The  arguments  for  the  pcre_dfa_exec()  function are the same as for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2512         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
# Line 2514  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2541  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2541    
2542         The  unused  bits  of  the options argument for pcre_dfa_exec() must be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2543         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2544         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
2545         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last         PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
2546         three of these are the same as for pcre_exec(), so their description is         TIAL_SOFT,  PCRE_DFA_SHORTEST,  and  PCRE_DFA_RESTART. All but the last
2547         not repeated here.         four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
2548           description is not repeated here.
2549           PCRE_PARTIAL  
2550             PCRE_PARTIAL_HARD
2551         This has the same general effect as it does for  pcre_exec(),  but  the           PCRE_PARTIAL_SOFT
2552         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for  
2553         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into         These  have the same general effect as they do for pcre_exec(), but the
2554         PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have         details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
2555         been no complete matches, but there is still at least one matching pos-         pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
2556         sibility.  The portion of the string that provided the partial match is         ject is reached and there is still at least  one  matching  possibility
2557         set as the first matching string.         that requires additional characters. This happens even if some complete
2558           matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2559           code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2560           of the subject is reached, there have been  no  complete  matches,  but
2561           there  is  still  at least one matching possibility. The portion of the
2562           string that was inspected when the longest partial match was  found  is
2563           set as the first matching string in both cases.
2564    
2565           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2566    
2567         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
2568         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2569         tive algorithm works, this is necessarily the shortest  possible  match         tive  algorithm  works, this is necessarily the shortest possible match
2570         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2571    
2572           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2573    
2574         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and         When pcre_dfa_exec() returns a partial match, it is possible to call it
2575         returns a partial match, it is possible to call it  again,  with  addi-         again,  with  additional  subject characters, and have it continue with
2576         tional  subject  characters,  and have it continue with the same match.         the same match. The PCRE_DFA_RESTART option requests this action;  when
2577         The PCRE_DFA_RESTART option requests this action; when it is  set,  the         it  is  set,  the workspace and wscount options must reference the same
2578         workspace  and wscount options must reference the same vector as before         vector as before because data about the match so far is  left  in  them
2579         because data about the match so far is left in  them  after  a  partial         after a partial match. There is more discussion of this facility in the
2580         match.  There  is  more  discussion of this facility in the pcrepartial         pcrepartial documentation.
        documentation.  
2581    
2582     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2583    
# Line 2634  AUTHOR Line 2666  AUTHOR
2666    
2667  REVISION  REVISION
2668    
2669         Last updated: 17 March 2009         Last updated: 11 September 2009
2670         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2671  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2672    
2673    
2674  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2675    
2676    
# Line 2813  REVISION Line 2845  REVISION
2845         Last updated: 15 March 2009         Last updated: 15 March 2009
2846         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2847  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2848    
2849    
2850  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2851    
2852    
# Line 2827  DIFFERENCES BETWEEN PCRE AND PERL Line 2859  DIFFERENCES BETWEEN PCRE AND PERL
2859         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2860         handle regular expressions. The differences described here  are  mainly         handle regular expressions. The differences described here  are  mainly
2861         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2862         some features that are expected to be in the forthcoming Perl 5.10.         some features that are in Perl 5.10.
2863    
2864         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2865         of  what  it does have are given in the section on UTF-8 support in the         of  what  it does have are given in the section on UTF-8 support in the
# Line 2859  DIFFERENCES BETWEEN PCRE AND PERL Line 2891  DIFFERENCES BETWEEN PCRE AND PERL
2891         is  built  with Unicode character property support. The properties that         is  built  with Unicode character property support. The properties that
2892         can be tested with \p and \P are limited to the general category  prop-         can be tested with \p and \P are limited to the general category  prop-
2893         erties  such  as  Lu and Nd, script names such as Greek or Han, and the         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2894         derived properties Any and L&.         derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
2895           property,  which  Perl  does  not; the Perl documentation says "Because
2896           Perl hides the need for the user to understand the internal representa-
2897           tion  of Unicode characters, there is no need to implement the somewhat
2898           messy concept of surrogates."
2899    
2900         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2901         ters  in  between  are  treated as literals. This is slightly different         ters  in  between  are  treated as literals. This is slightly different
# Line 2879  DIFFERENCES BETWEEN PCRE AND PERL Line 2915  DIFFERENCES BETWEEN PCRE AND PERL
2915    
2916         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2917         constructions. However, there is support for recursive  patterns.  This         constructions. However, there is support for recursive  patterns.  This
2918         is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE         is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
2919         "callout" feature allows an external function to be called during  pat-         "callout" feature allows an external function to be called during  pat-
2920         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
2921    
2922         9.  Subpatterns  that  are  called  recursively or as "subroutines" are         9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2923         always treated as atomic groups in  PCRE.  This  is  like  Python,  but         always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2924         unlike Perl.         unlike  Perl. There is a discussion of an example that explains this in
2925           more detail in the section on recursion differences from  Perl  in  the
2926           pcrecompat page.
2927    
2928         10.  There are some differences that are concerned with the settings of         10.  There are some differences that are concerned with the settings of
2929         captured strings when part of  a  pattern  is  repeated.  For  example,         captured strings when part of  a  pattern  is  repeated.  For  example,
# Line 2894  DIFFERENCES BETWEEN PCRE AND PERL Line 2932  DIFFERENCES BETWEEN PCRE AND PERL
2932    
2933         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
2934         (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in         (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
2935         the forms without an  argument.  PCRE  does  not  support  (*MARK).  If         the forms without an argument. PCRE does not support (*MARK).
        (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-  
        ture group; this is different to Perl.  
2936    
2937         12. PCRE provides some extensions to the Perl regular expression facil-         12. PCRE provides some extensions to the Perl regular expression facil-
2938         ities.   Perl  5.10  will  include new features that are not in earlier         ities.   Perl  5.10  will  include new features that are not in earlier
# Line 2921  DIFFERENCES BETWEEN PCRE AND PERL Line 2957  DIFFERENCES BETWEEN PCRE AND PERL
2957         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2958         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2959    
2960         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
2961         TURE options for pcre_exec() have no Perl equivalents.         and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
2962           lents.
2963    
2964         (g) The \R escape sequence can be restricted to match only CR,  LF,  or         (g)  The  \R escape sequence can be restricted to match only CR, LF, or
2965         CRLF by the PCRE_BSR_ANYCRLF option.         CRLF by the PCRE_BSR_ANYCRLF option.
2966    
2967         (h) The callout facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
# Line 2934  DIFFERENCES BETWEEN PCRE AND PERL Line 2971  DIFFERENCES BETWEEN PCRE AND PERL
2971         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
2972         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2973    
2974         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2975         different way and is not Perl-compatible.         different way and is not Perl-compatible.
2976    
2977         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
2978         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
2979         pattern.         pattern.
2980    
# Line 2951  AUTHOR Line 2988  AUTHOR
2988    
2989  REVISION  REVISION
2990    
2991         Last updated: 11 September 2007         Last updated: 18 September 2009
2992         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2993  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2994    
2995    
2996  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
2997    
2998    
# Line 2983  PCRE REGULAR EXPRESSION DETAILS Line 3020  PCRE REGULAR EXPRESSION DETAILS
3020         The original operation of PCRE was on strings of  one-byte  characters.         The original operation of PCRE was on strings of  one-byte  characters.
3021         However,  there is now also support for UTF-8 character strings. To use         However,  there is now also support for UTF-8 character strings. To use
3022         this, you must build PCRE to  include  UTF-8  support,  and  then  call         this, you must build PCRE to  include  UTF-8  support,  and  then  call
3023         pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern         pcre_compile()  with  the  PCRE_UTF8  option.  There  is also a special
3024         matching is mentioned in several places below. There is also a  summary         sequence that can be given at the start of a pattern:
3025         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre  
3026         page.           (*UTF8)
3027    
3028           Starting a pattern with this sequence  is  equivalent  to  setting  the
3029           PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
3030           UTF-8 mode affects pattern matching  is  mentioned  in  several  places
3031           below.  There  is  also  a  summary of UTF-8 features in the section on
3032           UTF-8 support in the main pcre page.
3033    
3034         The remainder of this document discusses the  patterns  that  are  sup-         The remainder of this document discusses the  patterns  that  are  sup-
3035         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
# Line 3464  BACKSLASH Line 3507  BACKSLASH
3507         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
3508         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
3509         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
3510         the pcreapi page).         the pcreapi page). Perl does not support the Cs property.
3511    
3512         The  long  synonyms  for  these  properties that Perl supports (such as         The  long  synonyms  for  property  names  that  Perl supports (such as
3513         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
3514         any of these properties with "Is".         any of these properties with "Is".
3515    
# Line 3832  INTERNAL OPTION SETTING Line 3875  INTERNAL OPTION SETTING
3875         can  be changed in the same way as the Perl-compatible options by using         can  be changed in the same way as the Perl-compatible options by using
3876         the characters J, U and X respectively.         the characters J, U and X respectively.
3877    
3878         When an option change occurs at top level (that is, not inside  subpat-         When one of these option changes occurs at  top  level  (that  is,  not
3879         tern  parentheses),  the change applies to the remainder of the pattern         inside  subpattern parentheses), the change applies to the remainder of
3880         that follows.  If the change is placed right at the start of a pattern,         the pattern that follows. If the change is placed right at the start of
3881         PCRE extracts it into the global options (and it will therefore show up         a pattern, PCRE extracts it into the global options (and it will there-
3882         in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
3883    
3884         An option change within a subpattern (see below for  a  description  of         An option change within a subpattern (see below for  a  description  of
3885         subpatterns) affects only that part of the current pattern that follows         subpatterns) affects only that part of the current pattern that follows
# Line 3859  INTERNAL OPTION SETTING Line 3902  INTERNAL OPTION SETTING
3902    
3903         Note:  There  are  other  PCRE-specific  options that can be set by the         Note:  There  are  other  PCRE-specific  options that can be set by the
3904         application when the compile or match functions  are  called.  In  some         application when the compile or match functions  are  called.  In  some
3905         cases  the  pattern  can  contain special leading sequences to override         cases the pattern can contain special leading sequences such as (*CRLF)
3906         what the application has set or what has been  defaulted.  Details  are         to override what the application has set or what  has  been  defaulted.
3907         given in the section entitled "Newline sequences" above.         Details  are  given  in the section entitled "Newline sequences" above.
3908           There is also the (*UTF8) leading sequence that  can  be  used  to  set
3909           UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
3910    
3911    
3912  SUBPATTERNS  SUBPATTERNS
# Line 4689  RECURSIVE PATTERNS Line 4734  RECURSIVE PATTERNS
4734         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4735         it  supports  special  syntax  for recursion of the entire pattern, and         it  supports  special  syntax  for recursion of the entire pattern, and
4736         also for individual subpattern recursion.  After  its  introduction  in         also for individual subpattern recursion.  After  its  introduction  in
4737         PCRE  and  Python,  this  kind of recursion was introduced into Perl at         PCRE  and  Python,  this  kind of recursion was subsequently introduced
4738         release 5.10.         into Perl at release 5.10.
4739    
4740         A special item that consists of (? followed by a  number  greater  than         A special item that consists of (? followed by a  number  greater  than
4741         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
# Line 4699  RECURSIVE PATTERNS Line 4744  RECURSIVE PATTERNS
4744         tion.) The special item (?R) or (?0) is a recursive call of the  entire         tion.) The special item (?R) or (?0) is a recursive call of the  entire
4745         regular expression.         regular expression.
4746    
4747         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is         This  PCRE  pattern  solves  the nested parentheses problem (assume the
        always treated as an atomic group. That is, once it has matched some of  
        the subject string, it is never re-entered, even if it contains untried  
        alternatives and there is a subsequent matching failure.  
   
        This PCRE pattern solves the nested  parentheses  problem  (assume  the  
4748         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
4749    
4750           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
4751    
4752         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
4753         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
4754         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
4755         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
4756    
4757         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
4758         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
4759    
4760           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4761    
4762         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
4763         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
4764    
4765         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
4766         tricky.  This is made easier by the use of relative references. (A Perl         tricky. This is made easier by the use of relative references. (A  Perl
4767         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write         5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
4768         (?-2) to refer to the second most recently opened parentheses preceding         (?-2) to refer to the second most recently opened parentheses preceding
4769         the recursion. In other  words,  a  negative  number  counts  capturing         the  recursion.  In  other  words,  a  negative number counts capturing
4770         parentheses leftwards from the point at which it is encountered.         parentheses leftwards from the point at which it is encountered.
4771    
4772         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
4773         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
4774         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
4775         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
4776         section.         section.
4777    
4778         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
4779         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
4780         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
4781    
4782           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4783    
4784         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
4785         one is used.         one is used.
4786    
4787         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
4788         nested  unlimited repeats, and so the use of atomic grouping for match-         nested unlimited repeats, and so the use of atomic grouping for  match-
4789         ing strings of non-parentheses is important when applying  the  pattern         ing  strings  of non-parentheses is important when applying the pattern
4790         to strings that do not match. For example, when this pattern is applied         to strings that do not match. For example, when this pattern is applied
4791         to         to
4792    
4793           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4794    
4795         it yields "no match" quickly. However, if atomic grouping is not  used,         it  yields "no match" quickly. However, if atomic grouping is not used,
4796         the  match  runs  for a very long time indeed because there are so many         the match runs for a very long time indeed because there  are  so  many
4797         different ways the + and * repeats can carve up the  subject,  and  all         different  ways  the  + and * repeats can carve up the subject, and all
4798         have to be tested before failure can be reported.         have to be tested before failure can be reported.
4799    
4800         At the end of a match, the values set for any capturing subpatterns are         At the end of a match, the values set for any capturing subpatterns are
4801         those from the outermost level of the recursion at which the subpattern         those from the outermost level of the recursion at which the subpattern
4802         value  is  set.   If  you want to obtain intermediate values, a callout         value is set.  If you want to obtain  intermediate  values,  a  callout
4803         function can be used (see below and the pcrecallout documentation).  If         function  can be used (see below and the pcrecallout documentation). If
4804         the pattern above is matched against         the pattern above is matched against
4805    
4806           (ab(cd)ef)           (ab(cd)ef)
4807    
4808         the  value  for  the  capturing  parentheses is "ef", which is the last         the value for the capturing parentheses is  "ef",  which  is  the  last
4809         value taken on at the top level. If additional parentheses  are  added,         value  taken  on at the top level. If additional parentheses are added,
4810         giving         giving
4811    
4812           \( ( ( (?>[^()]+) | (?R) )* ) \)           \( ( ( (?>[^()]+) | (?R) )* ) \)
4813              ^                        ^              ^                        ^
4814              ^                        ^              ^                        ^
4815    
4816         the  string  they  capture is "ab(cd)ef", the contents of the top level         the string they capture is "ab(cd)ef", the contents of  the  top  level
4817         parentheses. If there are more than 15 capturing parentheses in a  pat-         parentheses.  If there are more than 15 capturing parentheses in a pat-
4818         tern, PCRE has to obtain extra memory to store data during a recursion,         tern, PCRE has to obtain extra memory to store data during a recursion,
4819         which it does by using pcre_malloc, freeing  it  via  pcre_free  after-         which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
4820         wards.  If  no  memory  can  be  obtained,  the  match  fails  with the         wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
4821         PCRE_ERROR_NOMEMORY error.         PCRE_ERROR_NOMEMORY error.
4822    
4823         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
4824         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
4825         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
4826         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
4827         ted at the outer level.         ted at the outer level.
4828    
4829           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
4830    
4831         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
4832         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
4833         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
4834    
4835       Recursion difference from Perl
4836    
4837           In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
4838           always treated as an atomic group. That is, once it has matched some of
4839           the subject string, it is never re-entered, even if it contains untried
4840           alternatives and there is a subsequent matching failure.  This  can  be
4841           illustrated  by the following pattern, which purports to match a palin-
4842           dromic string that contains an odd number of characters  (for  example,
4843           "a", "aba", "abcba", "abcdcba"):
4844    
4845             ^(.|(.)(?1)\2)$
4846    
4847           The idea is that it either matches a single character, or two identical
4848           characters surrounding a sub-palindrome. In Perl, this  pattern  works;
4849           in  PCRE  it  does  not if the pattern is longer than three characters.
4850           Consider the subject string "abcba":
4851    
4852           At the top level, the first character is matched, but as it is  not  at
4853           the end of the string, the first alternative fails; the second alterna-
4854           tive is taken and the recursion kicks in. The recursive call to subpat-
4855           tern  1  successfully  matches the next character ("b"). (Note that the
4856           beginning and end of line tests are not part of the recursion).
4857    
4858           Back at the top level, the next character ("c") is compared  with  what
4859           subpattern  2 matched, which was "a". This fails. Because the recursion
4860           is treated as an atomic group, there are now  no  backtracking  points,
4861           and  so  the  entire  match fails. (Perl is able, at this point, to re-
4862           enter the recursion and try the second alternative.)  However,  if  the
4863           pattern is written with the alternatives in the other order, things are
4864           different:
4865    
4866             ^((.)(?1)\2|.)$
4867    
4868           This time, the recursing alternative is tried first, and  continues  to
4869           recurse  until  it runs out of characters, at which point the recursion
4870           fails. But this time we do have  another  alternative  to  try  at  the
4871           higher  level.  That  is  the  big difference: in the previous case the
4872           remaining alternative is at a deeper recursion level, which PCRE cannot
4873           use.
4874    
4875           To change the pattern so that matches all palindromic strings, not just
4876           those with an odd number of characters, it is tempting  to  change  the
4877           pattern to this:
4878    
4879             ^((.)(?1)\2|.?)$
4880    
4881           Again,  this  works  in Perl, but not in PCRE, and for the same reason.
4882           When a deeper recursion has matched a single character,  it  cannot  be
4883           entered  again  in  order  to match an empty string. The solution is to
4884           separate the two cases, and write out the odd and even cases as  alter-
4885           natives at the higher level:
4886    
4887             ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4888    
4889           If  you  want  to match typical palindromic phrases, the pattern has to
4890           ignore all non-word characters, which can be done like this:
4891    
4892             ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$
4893    
4894           If run with the PCRE_CASELESS option, this pattern matches phrases such
4895           as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
4896           Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
4897           ing  into  sequences of non-word characters. Without this, PCRE takes a
4898           great deal longer (ten times or more) to  match  typical  phrases,  and
4899           Perl takes so long that you think it has gone into a loop.
4900    
4901    
4902  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
4903    
4904         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4905         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
4906         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
4907         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
4908         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
4909    
# Line 4809  SUBPATTERNS AS SUBROUTINES Line 4915  SUBPATTERNS AS SUBROUTINES
4915    
4916           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4917    
4918         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
4919         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
4920    
4921           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
4922    
4923         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
4924         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
4925         above.         above.
4926    
4927         Like recursive subpatterns, a "subroutine" call is always treated as an         Like recursive subpatterns, a "subroutine" call is always treated as an
4928         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
4929         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
4930         there is a subsequent matching failure.         there is a subsequent matching failure.
4931    
4932         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
4933         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4934         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4935    
4936           (abc)(?i:(?-1))           (abc)(?i:(?-1))
4937    
4938         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
4939         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
4940    
4941    
4942  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
4943    
4944         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
4945         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
4946         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
4947         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
4948         ten using this syntax:         ten using this syntax:
4949    
4950           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4951           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
4952    
4953         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
4954         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
4955    
4956           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
4957    
4958         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
4959         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
4960         call.         call.
4961    
4962    
4963  CALLOUTS  CALLOUTS
4964    
4965         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4966         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
4967         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
4968         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
4969         tion.         tion.
4970    
4971         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
4972         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
4973         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
4974         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
4975         all calling out.         all calling out.
4976    
4977         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
4978         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
4979         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
4980         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
4981         points:         points:
4982    
4983           (?C1)abc(?C2)def           (?C1)abc(?C2)def
4984    
4985         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4986         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
4987         numbered 255.         numbered 255.
4988    
4989         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
4990         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
4991         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
4992         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
4993         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
4994         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4995         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4996    
4997    
4998  BACKTRACKING CONTROL  BACKTRACKING CONTROL
4999    
5000         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5001         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5002         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5003         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5004         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5005         in this section.         in this section.
5006    
5007         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5008         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5009         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5010         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5011         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5012    
5013           If any of these verbs are used in an assertion subpattern, their effect
5014           is  confined  to that subpattern; it does not extend to the surrounding
5015           pattern.  Note that assertion subpatterns are processed as anchored  at
5016           the point where they are tested.
5017    
5018         The  new verbs make use of what was previously invalid syntax: an open-         The  new verbs make use of what was previously invalid syntax: an open-
5019         ing parenthesis followed by an asterisk. In Perl, they are generally of         ing parenthesis followed by an asterisk. In Perl, they are generally of
5020         the form (*VERB:ARG) but PCRE does not support the use of arguments, so         the form (*VERB:ARG) but PCRE does not support the use of arguments, so
# Line 4918  BACKTRACKING CONTROL Line 5029  BACKTRACKING CONTROL
5029    
5030         This  verb causes the match to end successfully, skipping the remainder         This  verb causes the match to end successfully, skipping the remainder
5031         of the pattern. When inside a recursion, only the innermost pattern  is         of the pattern. When inside a recursion, only the innermost pattern  is
5032         ended  immediately.  PCRE  differs  from  Perl  in  what happens if the         ended  immediately.  If  the (*ACCEPT) is inside capturing parentheses,
5033         (*ACCEPT) is inside capturing parentheses. In Perl, the data so far  is         the data so far is captured. (This feature was added to PCRE at release
5034         captured: in PCRE no data is captured. For example:         8.00.) For example:
5035    
5036           A(A|B(*ACCEPT)|C)D           A((?:A|B(*ACCEPT)|C)D)
5037    
5038         This  matches  "AB", "AAD", or "ACD", but when it matches "AB", no data         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
5039         is captured.         tured by the outer parentheses.
5040    
5041           (*FAIL) or (*F)           (*FAIL) or (*F)
5042    
# Line 5021  AUTHOR Line 5132  AUTHOR
5132    
5133  REVISION  REVISION
5134    
5135         Last updated: 18 March 2009         Last updated: 18 September 2009
5136         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5137  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5138    
5139    
5140  PCRESYNTAX(3)                                                    PCRESYNTAX(3)  PCRESYNTAX(3)                                                    PCRESYNTAX(3)
5141    
5142    
# Line 5134  GENERAL CATEGORY PROPERTY CODES FOR \p a Line 5245  GENERAL CATEGORY PROPERTY CODES FOR \p a
5245  SCRIPT NAMES FOR \p AND \P  SCRIPT NAMES FOR \p AND \P
5246    
5247         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
5248         Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,         Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu-
5249         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,         neiform,  Cypriot,  Cyrillic,  Deseret, Devanagari, Ethiopic, Georgian,
5250         Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-         Glagolitic, Gothic, Greek, Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,
5251         gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,         Hebrew,  Hiragana,  Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi,
5252         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,         Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian,  Malayalam,
5253         Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,         Mongolian,  Myanmar,  New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian,
5254         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,         Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash-
5255         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.         tra,  Shavian,  Sinhala,  Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag-
5256           banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
5257           Ugaritic, Vai, Yi.
5258    
5259    
5260  CHARACTER CLASSES  CHARACTER CLASSES
# Line 5193  QUANTIFIERS Line 5306  QUANTIFIERS
5306    
5307  ANCHORS AND SIMPLE ASSERTIONS  ANCHORS AND SIMPLE ASSERTIONS
5308    
5309           \b          word boundary           \b          word boundary (only ASCII letters recognized)
5310           \B          not a word boundary           \B          not a word boundary
5311           ^           start of subject           ^           start of subject
5312                        also after internal newline in multiline mode                        also after internal newline in multiline mode
# Line 5219  ALTERNATION Line 5332  ALTERNATION
5332    
5333  CAPTURING  CAPTURING
5334    
5335           (...)          capturing group           (...)           capturing group
5336           (?<name>...)   named capturing group (Perl)           (?<name>...)    named capturing group (Perl)
5337           (?'name'...)   named capturing group (Perl)           (?'name'...)    named capturing group (Perl)
5338           (?P<name>...)  named capturing group (Python)           (?P<name>...)   named capturing group (Python)
5339           (?:...)        non-capturing group           (?:...)         non-capturing group
5340           (?|...)        non-capturing group; reset group numbers for           (?|...)         non-capturing group; reset group numbers for
5341                           capturing groups in each alternative                            capturing groups in each alternative
5342    
5343    
5344  ATOMIC GROUPS  ATOMIC GROUPS
5345    
5346           (?>...)        atomic, non-capturing group           (?>...)         atomic, non-capturing group
5347    
5348    
5349  COMMENT  COMMENT
5350    
5351           (?#....)       comment (not nestable)           (?#....)        comment (not nestable)
5352    
5353    
5354  OPTION SETTING  OPTION SETTING
5355    
5356           (?i)           caseless           (?i)            caseless
5357           (?J)           allow duplicate names           (?J)            allow duplicate names
5358           (?m)           multiline           (?m)            multiline
5359           (?s)           single line (dotall)           (?s)            single line (dotall)
5360           (?U)           default ungreedy (lazy)           (?U)            default ungreedy (lazy)
5361           (?x)           extended (ignore white space)           (?x)            extended (ignore white space)
5362           (?-...)        unset option(s)           (?-...)         unset option(s)
5363    
5364           The following is recognized only at the start of a pattern or after one
5365           of the newline-setting options with similar syntax:
5366    
5367             (*UTF8)         set UTF-8 mode
5368    
5369    
5370  LOOKAHEAD AND LOOKBEHIND ASSERTIONS  LOOKAHEAD AND LOOKBEHIND ASSERTIONS
5371    
5372           (?=...)        positive look ahead           (?=...)         positive look ahead
5373           (?!...)        negative look ahead           (?!...)         negative look ahead
5374           (?<=...)       positive look behind           (?<=...)        positive look behind
5375           (?<!...)       negative look behind           (?<!...)        negative look behind
5376    
5377         Each top-level branch of a look behind must be of a fixed length.         Each top-level branch of a look behind must be of a fixed length.
5378    
5379    
5380  BACKREFERENCES  BACKREFERENCES
5381    
5382           \n             reference by number (can be ambiguous)           \n              reference by number (can be ambiguous)
5383           \gn            reference by number           \gn             reference by number
5384           \g{n}          reference by number           \g{n}           reference by number
5385           \g{-n}         relative reference by number           \g{-n}          relative reference by number
5386           \k<name>       reference by name (Perl)           \k<name>        reference by name (Perl)
5387           \k'name'       reference by name (Perl)           \k'name'        reference by name (Perl)
5388           \g{name}       reference by name (Perl)           \g{name}        reference by name (Perl)
5389           \k{name}       reference by name (.NET)           \k{name}        reference by name (.NET)
5390           (?P=name)      reference by name (Python)           (?P=name)       reference by name (Python)
5391    
5392    
5393  SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)  SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
5394    
5395           (?R)           recurse whole pattern           (?R)            recurse whole pattern
5396           (?n)           call subpattern by absolute number           (?n)            call subpattern by absolute number
5397           (?+n)          call subpattern by relative number           (?+n)           call subpattern by relative number
5398           (?-n)          call subpattern by relative number           (?-n)           call subpattern by relative number
5399           (?&name)       call subpattern by name (Perl)           (?&name)        call subpattern by name (Perl)
5400           (?P>name)      call subpattern by name (Python)           (?P>name)       call subpattern by name (Python)
5401           \g<name>       call subpattern by name (Oniguruma)           \g<name>        call subpattern by name (Oniguruma)
5402           \g'name'       call subpattern by name (Oniguruma)           \g'name'        call subpattern by name (Oniguruma)
5403           \g<n>          call subpattern by absolute number (Oniguruma)           \g<n>           call subpattern by absolute number (Oniguruma)
5404           \g'n'          call subpattern by absolute number (Oniguruma)           \g'n'           call subpattern by absolute number (Oniguruma)
5405           \g<+n>         call subpattern by relative number (PCRE extension)           \g<+n>          call subpattern by relative number (PCRE extension)
5406           \g'+n'         call subpattern by relative number (PCRE extension)           \g'+n'          call subpattern by relative number (PCRE extension)
5407           \g<-n>         call subpattern by relative number (PCRE extension)           \g<-n>          call subpattern by relative number (PCRE extension)
5408           \g'-n'         call subpattern by relative number (PCRE extension)           \g'-n'          call subpattern by relative number (PCRE extension)
5409    
5410    
5411  CONDITIONAL PATTERNS  CONDITIONAL PATTERNS
# Line 5295  CONDITIONAL PATTERNS Line 5413  CONDITIONAL PATTERNS
5413           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5414           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5415    
5416           (?(n)...       absolute reference condition           (?(n)...        absolute reference condition
5417           (?(+n)...      relative reference condition           (?(+n)...       relative reference condition
5418           (?(-n)...      relative reference condition           (?(-n)...       relative reference condition
5419           (?(<name>)...  named reference condition (Perl)           (?(<name>)...   named reference condition (Perl)
5420           (?('name')...  named reference condition (Perl)           (?('name')...   named reference condition (Perl)
5421           (?(name)...    named reference condition (PCRE)           (?(name)...     named reference condition (PCRE)
5422           (?(R)...       overall recursion condition           (?(R)...        overall recursion condition
5423           (?(Rn)...      specific group recursion condition           (?(Rn)...       specific group recursion condition
5424           (?(R&name)...  specific recursion condition           (?(R&name)...   specific recursion condition
5425           (?(DEFINE)...  define subpattern for reference           (?(DEFINE)...   define subpattern for reference
5426           (?(assert)...  assertion condition           (?(assert)...   assertion condition
5427    
5428    
5429  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5430    
5431         The following act immediately they are reached:         The following act immediately they are reached:
5432    
5433           (*ACCEPT)      force successful match           (*ACCEPT)       force successful match
5434           (*FAIL)        force backtrack; synonym (*F)           (*FAIL)         force backtrack; synonym (*F)
5435    
5436         The following act only when a subsequent match failure causes  a  back-         The  following  act only when a subsequent match failure causes a back-
5437         track to reach them. They all force a match failure, but they differ in         track to reach them. They all force a match failure, but they differ in
5438         what happens afterwards. Those that advance the start-of-match point do         what happens afterwards. Those that advance the start-of-match point do
5439         so only if the pattern is not anchored.         so only if the pattern is not anchored.
5440    
5441           (*COMMIT)      overall failure, no advance of starting point           (*COMMIT)       overall failure, no advance of starting point
5442           (*PRUNE)       advance to next starting character           (*PRUNE)        advance to next starting character
5443           (*SKIP)        advance start to current matching position           (*SKIP)         advance start to current matching position
5444           (*THEN)        local failure, backtrack to next alternation           (*THEN)         local failure, backtrack to next alternation
5445    
5446    
5447  NEWLINE CONVENTIONS  NEWLINE CONVENTIONS
5448    
5449         These  are  recognized only at the very start of the pattern or after a         These are recognized only at the very start of the pattern or  after  a
5450         (*BSR_...) option.         (*BSR_...) or (*UTF8) option.
5451    
5452           (*CR)           (*CR)           carriage return only
5453           (*LF)           (*LF)           linefeed only
5454           (*CRLF)           (*CRLF)         carriage return followed by linefeed
5455           (*ANYCRLF)           (*ANYCRLF)      all three of the above
5456           (*ANY)           (*ANY)          any Unicode newline sequence
5457    
5458    
5459  WHAT \R MATCHES  WHAT \R MATCHES
5460    
5461         These are recognized only at the very start of the pattern or  after  a         These  are  recognized only at the very start of the pattern or after a
5462         (*...) option that sets the newline convention.         (*...) option that sets the newline convention or UTF-8 mode.
5463    
5464           (*BSR_ANYCRLF)           (*BSR_ANYCRLF)  CR, LF, or CRLF
5465           (*BSR_UNICODE)           (*BSR_UNICODE)  any Unicode newline sequence
5466    
5467    
5468  CALLOUTS  CALLOUTS
# Line 5367  AUTHOR Line 5485  AUTHOR
5485    
5486  REVISION  REVISION
5487    
5488         Last updated: 09 April 2008         Last updated: 11 April 2009
5489         Copyright (c) 1997-2008 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5490  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5491    
5492    
5493  PCREPARTIAL(3)                                                  PCREPARTIAL(3)  PCREPARTIAL(3)                                                  PCREPARTIAL(3)
5494    
5495    
# Line 5395  PARTIAL MATCHING IN PCRE Line 5513  PARTIAL MATCHING IN PCRE
5513    
5514         If the application sees the user's keystrokes one by one, and can check         If the application sees the user's keystrokes one by one, and can check
5515         that what has been typed so far is potentially valid,  it  is  able  to         that what has been typed so far is potentially valid,  it  is  able  to
5516         raise  an  error as soon as a mistake is made, possibly beeping and not         raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
5517         reflecting the character that has been typed. This  immediate  feedback         reflecting the character that has been typed, for example. This immedi-
5518         is  likely  to  be a better user interface than a check that is delayed         ate  feedback is likely to be a better user interface than a check that
5519         until the entire string has been entered.         is delayed until the entire string has been entered.  Partial  matching
5520           can  also  sometimes be useful when the subject string is very long and
5521         PCRE supports the concept of partial matching by means of the PCRE_PAR-         is not all available at once.
5522         TIAL   option,   which   can   be   set  when  calling  pcre_exec()  or  
5523         pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code         PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
5524         PCRE_ERROR_NOMATCH  is converted into PCRE_ERROR_PARTIAL if at any time         PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
5525         during the matching process the last part of the subject string matched         pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
5526         part  of  the  pattern. Unfortunately, for non-anchored matching, it is         for PCRE_PARTIAL_SOFT. The essential difference between the two options
5527         not possible to obtain the position of the start of the partial  match.         is whether or not a partial match is preferred to an  alternative  com-
5528         No captured data is set when PCRE_ERROR_PARTIAL is returned.         plete  match,  though the details differ between the two matching func-
5529           tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
5530         When   PCRE_PARTIAL   is  set  for  pcre_dfa_exec(),  the  return  code  
5531         PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the  end  of         Setting a partial matching option disables one of PCRE's optimizations.
5532         the  subject is reached, there have been no complete matches, but there         PCRE  remembers the last literal byte in a pattern, and abandons match-
5533         is still at least one matching possibility. The portion of  the  string         ing immediately if such a byte is not present in  the  subject  string.
5534         that provided the partial match is set as the first matching string.         This  optimization cannot be used for a subject string that might match
5535           only partially.
5536         Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers  
5537         the last literal byte in a pattern, and abandons  matching  immediately  
5538         if  such a byte is not present in the subject string. This optimization  PARTIAL MATCHING USING pcre_exec()
5539         cannot be used for a subject string that might match only partially.  
5540           A partial match occurs during a call to pcre_exec() whenever the end of
5541           the  subject  string  is reached successfully, but matching cannot con-
5542  RESTRICTED PATTERNS FOR PCRE_PARTIAL         tinue because more characters are needed. However, at least one charac-
5543           ter  must have been matched. (In other words, a partial match can never
5544         Because of the way certain internal optimizations  are  implemented  in         be an empty string.)
5545         the  pcre_exec()  function, the PCRE_PARTIAL option cannot be used with  
5546         all patterns. These restrictions do not apply when  pcre_dfa_exec()  is         If PCRE_PARTIAL_SOFT is set,  the  partial  match  is  remembered,  but
5547         used.  For pcre_exec(), repeated single characters such as         matching continues as normal, and other alternatives in the pattern are
5548           tried.  If  no  complete  match  can  be  found,  pcre_exec()   returns
5549           a{2,4}         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
5550           two slots in the offsets vector, the first of them is set to the offset
5551         and repeated single metasequences such as         of the earliest character that was inspected when the partial match was
5552           found. For convenience, the second offset points  to  the  end  of  the
5553           \d+         string so that a substring can easily be extracted.
5554    
5555         are  not permitted if the maximum number of occurrences is greater than         For  the majority of patterns, the first offset identifies the start of
5556         one.  Optional items such as \d? (where the maximum is one) are permit-         the partially matched string. However, for patterns that contain  look-
5557         ted.   Quantifiers  with any values are permitted after parentheses, so         behind  assertions,  or  \K, or begin with \b or \B, earlier characters
5558         the invalid examples above can be coded thus:         have been inspected while carrying out the match. For example:
5559    
5560           (a){2,4}           /(?<=abc)123/
5561           (\d)+  
5562           This pattern matches "123", but only if it is preceded by "abc". If the
5563         These constructions run more slowly, but for the kinds  of  application         subject string is "xyzabc12", the offsets after a partial match are for
5564         that  are  envisaged  for this facility, this is not felt to be a major         the substring "abc12", because  all  these  characters  are  needed  if
5565         restriction.         another match is tried with extra characters added.
5566    
5567         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the         If  there  is more than one partial match, the first one that was found
5568         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL         provides the data that is returned. Consider this pattern:
5569         (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to  
5570         find out if a compiled pattern can be used for partial matching.           /123\w+X|dogY/
5571    
5572           If this is matched against the subject string "abc123dog", both  alter-
5573           natives  fail  to  match,  but the end of the subject is reached during
5574           matching,   so    PCRE_ERROR_PARTIAL    is    returned    instead    of
5575           PCRE_ERROR_NOMATCH.  The  offsets  are  set  to  3  and  9, identifying
5576           "123dog" as the first partial match that was found. (In  this  example,
5577           there  are  two  partial  matches,  because  "dog" on its own partially
5578           matches the second alternative.)
5579    
5580           If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
5581           TIAL  as soon as a partial match is found, without continuing to search
5582           for possible complete matches. The difference between the  two  options
5583           can be illustrated by a pattern such as:
5584    
5585             /dog(sbody)?/
5586    
5587           This  matches either "dog" or "dogsbody", greedily (that is, it prefers
5588           the longer string if possible). If it is  matched  against  the  string
5589           "dog"  with  PCRE_PARTIAL_SOFT,  it  yields a complete match for "dog".
5590           However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
5591           On  the  other hand, if the pattern is made ungreedy the result is dif-
5592           ferent:
5593    
5594             /dog(sbody)??/
5595    
5596           In this case the result is always a complete match because  pcre_exec()
5597           finds  that  first,  and  it  never continues after finding a match. It
5598           might be easier to follow this explanation by thinking of the two  pat-
5599           terns like this:
5600    
5601             /dog(sbody)?/    is the same as  /dogsbody|dog/
5602             /dog(sbody)??/   is the same as  /dog|dogsbody/
5603    
5604           The  second  pattern  will  never  match "dogsbody" when pcre_exec() is
5605           used, because it will always find the shorter match first.
5606    
5607    
5608    PARTIAL MATCHING USING pcre_dfa_exec()
5609    
5610           The pcre_dfa_exec() function moves along the subject  string  character
5611           by  character, without backtracking, searching for all possible matches
5612           simultaneously. If the end of the subject is reached before the end  of
5613           the  pattern,  there  is the possibility of a partial match, again pro-
5614           vided that at least one character has matched.
5615    
5616           When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
5617           there  have  been  no complete matches. Otherwise, the complete matches
5618           are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
5619           takes  precedence  over any complete matches. The portion of the string
5620           that was inspected when the longest partial match was found is  set  as
5621           the first matching string, provided there are at least two slots in the
5622           offsets vector.
5623    
5624           Because pcre_dfa_exec() always searches for all possible  matches,  and
5625           there  is no difference between greedy and ungreedy repetition, its be-
5626           haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
5627           sider  the  string  "dog"  matched  against  the ungreedy pattern shown
5628           above:
5629    
5630             /dog(sbody)??/
5631    
5632           Whereas pcre_exec() stops as soon as it finds the  complete  match  for
5633           "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
5634           so returns that when PCRE_PARTIAL_HARD is set.
5635    
5636    
5637    PARTIAL MATCHING AND WORD BOUNDARIES
5638    
5639           If a pattern ends with one of sequences \w or \W, which test  for  word
5640           boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
5641           intuitive results. Consider this pattern:
5642    
5643             /\bcat\b/
5644    
5645           This matches "cat", provided there is a word boundary at either end. If
5646           the subject string is "the cat", the comparison of the final "t" with a
5647           following character cannot take place, so a  partial  match  is  found.
5648           However,  pcre_exec() carries on with normal matching, which matches \b
5649           at the end of the subject when the last character  is  a  letter,  thus
5650           finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
5651           TIAL. The same thing happens  with  pcre_dfa_exec(),  because  it  also
5652           finds the complete match.
5653    
5654           Using  PCRE_PARTIAL_HARD  in  this  case does yield PCRE_ERROR_PARTIAL,
5655           because then the partial match takes precedence.
5656    
5657    
5658    FORMERLY RESTRICTED PATTERNS
5659    
5660           For releases of PCRE prior to 8.00, because of the way certain internal
5661           optimizations   were  implemented  in  the  pcre_exec()  function,  the
5662           PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
5663           used  with all patterns. From release 8.00 onwards, the restrictions no
5664           longer apply, and partial matching with pcre_exec()  can  be  requested
5665           for any pattern.
5666    
5667           Items that were formerly restricted were repeated single characters and
5668           repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
5669           not  conform  to  the restrictions, pcre_exec() returned the error code
5670           PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
5671           PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
5672           pattern can be used for partial matching now always returns 1.
5673    
5674    
5675  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
5676    
5677         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If the escape sequence \P is present  in  a  pcretest  data  line,  the
5678         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
5679         uses the date example quoted above:         pcretest that uses the date example quoted above:
5680    
5681             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5682           data> 25jun04\P           data> 25jun04\P
5683            0: 25jun04            0: 25jun04
5684            1: jun            1: jun
5685           data> 25dec3\P           data> 25dec3\P
5686           Partial match           Partial match: 23dec3
5687           data> 3ju\P           data> 3ju\P
5688           Partial match           Partial match: 3ju
5689           data> 3juj\P           data> 3juj\P
5690           No match           No match
5691           data> j\P           data> j\P
5692           No match           No match
5693    
5694         The  first  data  string  is  matched completely, so pcretest shows the         The first data string is matched  completely,  so  pcretest  shows  the
5695         matched substrings. The remaining four strings do not  match  the  com-         matched  substrings.  The  remaining four strings do not match the com-
5696         plete  pattern,  but  the first two are partial matches. The same test,         plete pattern, but the first two are partial matches. Similar output is
5697         using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),         obtained when pcre_dfa_exec() is used.
        produces the following output:  
   
            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/  
          data> 25jun04\P\D  
           0: 25jun04  
          data> 23dec3\P\D  
          Partial match: 23dec3  
          data> 3ju\P\D  
          Partial match: 3ju  
          data> 3juj\P\D  
          No match  
          data> j\P\D  
          No match  
5698    
5699         Notice  that in this case the portion of the string that was matched is         If  the escape sequence \P is present more than once in a pcretest data
5700         made available.         line, the PCRE_PARTIAL_HARD option is set for the match.
5701    
5702    
5703  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
# Line 5498  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5705  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5705         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
5706         ble  to  continue  the  match  by providing additional subject data and         ble  to  continue  the  match  by providing additional subject data and
5707         calling pcre_dfa_exec() again with the same  compiled  regular  expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
5708         sion, this time setting the PCRE_DFA_RESTART option. You must also pass         sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
5709         the same working space as before, because this is where details of  the         same working space as before, because this is where details of the pre-
5710         previous  partial  match are stored. Here is an example using pcretest,         vious  partial  match  are  stored.  Here is an example using pcretest,
5711         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and         using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
5712         \D are as above):         specifies the use of pcre_dfa_exec()):
5713    
5714             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5715           data> 23ja\P\D           data> 23ja\P\D
# Line 5517  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5724  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5724         matched  string. It is up to the calling program to do that if it needs         matched  string. It is up to the calling program to do that if it needs
5725         to.         to.
5726    
5727         You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial         You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
5728         matching over multiple segments. This facility can be used to pass very         PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
5729         long subject strings to pcre_dfa_exec(). However, some care  is  needed         This facility can  be  used  to  pass  very  long  subject  strings  to
5730         for certain types of pattern.         pcre_dfa_exec().
5731    
5732    
5733    MULTI-SEGMENT MATCHING WITH pcre_exec()
5734    
5735           From  release  8.00,  pcre_exec()  can also be used to do multi-segment
5736           matching. Unlike pcre_dfa_exec(), it is not  possible  to  restart  the
5737           previous  match  with  a new segment of data. Instead, new data must be
5738           added to the previous subject string,  and  the  entire  match  re-run,
5739           starting  from the point where the partial match occurred. Earlier data
5740           can be discarded.  Consider an unanchored pattern that matches dates:
5741    
5742               re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
5743             data> The date is 23ja\P
5744             Partial match: 23ja
5745    
5746           The this stage, an application could discard the text preceding "23ja",
5747           add  on  text from the next segment, and call pcre_exec() again. Unlike
5748           pcre_dfa_exec(), the entire matching string must always  be  available,
5749           and  the complete matching process occurs for each call, so more memory
5750           and more processing time is needed.
5751    
5752           Note: If the pattern contains lookbehind assertions, or \K,  or  starts
5753           with  \b  or  \B,  the string that is returned for a partial match will
5754           include characters that precede the partially  matched  string  itself,
5755           because  these  must  be  retained when adding on more characters for a
5756           subsequent matching attempt.
5757    
5758    
5759    ISSUES WITH MULTI-SEGMENT MATCHING
5760    
5761           Certain types of pattern may give problems with multi-segment matching,
5762           whichever matching function is used.
5763    
5764         1.  If  the  pattern contains tests for the beginning or end of a line,         1.  If  the  pattern contains tests for the beginning or end of a line,
5765         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
5766         ate,  when  the subject string for any call does not contain the begin-         ate,  when  the subject string for any call does not contain the begin-
5767         ning or end of a line.         ning or end of a line.
5768    
5769         2. If the pattern contains backward assertions (including  \b  or  \B),         2. Lookbehind assertions at the start of a pattern are catered  for  in
5770         you  need  to  arrange for some overlap in the subject strings to allow         the  offsets that are returned for a partial match. However, in theory,
5771         for this. For example, you could pass the subject in  chunks  that  are         a lookbehind assertion later in the pattern could require even  earlier
5772         500  bytes long, but in a buffer of 700 bytes, with the starting offset         characters  to  be inspected, and it might not have been reached when a
5773         set to 200 and the previous 200 bytes at the start of the buffer.         partial match occurs. This is probably an extremely unlikely case;  you
5774           could  guard  against  it to a certain extent by always including extra
5775           characters at the start.
5776    
5777         3. Matching a subject string that is split into multiple segments  does         3. Matching a subject string that is split into multiple  segments  may
5778         not  always produce exactly the same result as matching over one single         not  always produce exactly the same result as matching over one single
5779         long string.  The difference arises when there  are  multiple  matching         long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
5780         possibilities,  because a partial match result is given only when there         "Partial  Matching  and  Word Boundaries" above describes an issue that
5781         are no completed matches in a call to pcre_dfa_exec(). This means  that         arises if the pattern ends with \b or \B. Another  kind  of  difference
5782         as  soon  as  the  shortest match has been found, continuation to a new         may  occur  when  there  are multiple matching possibilities, because a
5783         subject segment is no longer possible.  Consider this pcretest example:         partial match result is given only when there are no completed matches.
5784           This means that as soon as the shortest match has been found, continua-
5785           tion to a new subject segment is no longer  possible.   Consider  again
5786           this pcretest example:
5787    
5788             re> /dog(sbody)?/             re> /dog(sbody)?/
5789             data> dogsb\P
5790              0: dog
5791           data> do\P\D           data> do\P\D
5792           Partial match: do           Partial match: do
5793           data> gsb\R\P\D           data> gsb\R\P\D
# Line 5550  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5796  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5796            0: dogsbody            0: dogsbody
5797            1: dog            1: dog
5798    
5799         The pattern matches the words "dog" or "dogsbody". When the subject  is         The  first  data line passes the string "dogsb" to pcre_exec(), setting
5800         presented  in  several  parts  ("do" and "gsb" being the first two) the         the PCRE_PARTIAL_SOFT option. Although the string is  a  partial  match
5801         match stops when "dog" has been found, and it is not possible  to  con-         for  "dogsbody",  the  result  is  not  PCRE_ERROR_PARTIAL, because the
5802         tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single         shorter string "dog" is a complete match. Similarly, when  the  subject
5803         string, both matches are found.         is  presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
5804           the first two) the match stops when "dog" has been found, and it is not
5805           possible  to continue. On the other hand, if "dogsbody" is presented as
5806           a single string, pcre_dfa_exec() finds both matches.
5807    
5808           Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
5809           when  matching  multi-segment data. The example above then behaves dif-
5810           ferently:
5811    
5812               re> /dog(sbody)?/
5813             data> dogsb\P\P
5814             Partial match: dogsb
5815             data> do\P\D
5816             Partial match: do
5817             data> gsb\R\P\P\D
5818             Partial match: gsb
5819    
        Because of this phenomenon, it does not usually make  sense  to  end  a  
        pattern that is going to be matched in this way with a variable repeat.  
5820    
5821         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
5822         start with the same pattern item may not work as expected. For example,         start  with  the  same  pattern  item  may  not  work  as expected when
5823         consider this pattern:         pcre_dfa_exec() is used. For example, consider this pattern:
5824    
5825           1234|3789           1234|3789
5826    
5827         If  the  first  part of the subject is "ABC123", a partial match of the         If the first part of the subject is "ABC123", a partial  match  of  the
5828         first alternative is found at offset 3. There is no partial  match  for         first  alternative  is found at offset 3. There is no partial match for
5829         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
5830         point in the subject string. Attempting to  continue  with  the  string         point  in  the  subject  string. Attempting to continue with the string
5831         "789" does not yield a match because only those alternatives that match         "7890" does not yield a match  because  only  those  alternatives  that
5832         at one point in the subject are remembered. The problem arises  because         match  at  one  point in the subject are remembered. The problem arises
5833         the  start  of the second alternative matches within the first alterna-         because the start of the second alternative matches  within  the  first
5834         tive. There is no problem with anchored patterns or patterns such as:         alternative.  There  is  no  problem with anchored patterns or patterns
5835           such as:
5836    
5837           1234|ABCD           1234|ABCD
5838    
5839         where no string can be a partial match for both alternatives.         where no string can be a partial match for both alternatives.  This  is
5840           not  a  problem if pcre_exec() is used, because the entire match has to
5841           be rerun each time:
5842    
5843               re> /1234|3789/
5844             data> ABC123\P
5845             Partial match: 123
5846             data> 1237890
5847              0: 3789
5848    
5849    
5850  AUTHOR  AUTHOR
# Line 5588  AUTHOR Line 5856  AUTHOR
5856    
5857  REVISION  REVISION
5858    
5859         Last updated: 04 June 2007         Last updated: 05 September 2009
5860         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5861  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5862    
5863    
5864  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
5865    
5866    
# Line 5715  REVISION Line 5983  REVISION
5983         Last updated: 13 June 2007         Last updated: 13 June 2007
5984         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5985  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5986    
5987    
5988  PCREPERFORM(3)                                                  PCREPERFORM(3)  PCREPERFORM(3)                                                  PCREPERFORM(3)
5989    
5990    
# Line 5865  REVISION Line 6133  REVISION
6133         Last updated: 06 March 2007         Last updated: 06 March 2007
6134         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
6135  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6136    
6137    
6138  PCREPOSIX(3)                                                      PCREPOSIX(3)  PCREPOSIX(3)                                                      PCREPOSIX(3)
6139    
6140    
# Line 5910  DESCRIPTION Line 6178  DESCRIPTION
6178         easier to slot in PCRE as a replacement library.  Other  POSIX  options         easier to slot in PCRE as a replacement library.  Other  POSIX  options
6179         are not even defined.         are not even defined.
6180    
6181           There  are also some other options that are not defined by POSIX. These
6182           have been added at the request of users who want to make use of certain
6183           PCRE-specific features via the POSIX calling interface.
6184    
6185         When  PCRE  is  called  via these functions, it is only the API that is         When  PCRE  is  called  via these functions, it is only the API that is
6186         POSIX-like in style. The syntax and semantics of  the  regular  expres-         POSIX-like in style. The syntax and semantics of  the  regular  expres-
6187         sions  themselves  are  still  those of Perl, subject to the setting of         sions  themselves  are  still  those of Perl, subject to the setting of
# Line 5964  COMPILING A PATTERN Line 6236  COMPILING A PATTERN
6236         ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured         ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
6237         strings are returned.         strings are returned.
6238    
6239             REG_UNGREEDY
6240    
6241           The PCRE_UNGREEDY option is set when the regular expression  is  passed
6242           for  compilation  to the native function. Note that REG_UNGREEDY is not
6243           part of the POSIX standard.
6244    
6245           REG_UTF8           REG_UTF8
6246    
6247         The PCRE_UTF8 option is set when the regular expression is  passed  for         The PCRE_UTF8 option is set when the regular expression is  passed  for
# Line 5976  COMPILING A PATTERN Line 6254  COMPILING A PATTERN
6254         semantics.  In particular, the way it handles newline characters in the         semantics.  In particular, the way it handles newline characters in the
6255         subject string is the Perl way, not the POSIX way.  Note  that  setting         subject string is the Perl way, not the POSIX way.  Note  that  setting
6256         PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.         PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
6257         It does not affect the way newlines are matched by . (they  aren't)  or         It does not affect the way newlines are matched by . (they are not)  or
6258         by a negative class such as [^a] (they are).         by a negative class such as [^a] (they are).
6259    
6260         The  yield of regcomp() is zero on success, and non-zero otherwise. The         The  yield of regcomp() is zero on success, and non-zero otherwise. The
# Line 5984  COMPILING A PATTERN Line 6262  COMPILING A PATTERN
6262         is  public: re_nsub contains the number of capturing subpatterns in the         is  public: re_nsub contains the number of capturing subpatterns in the
6263         regular expression. Various error codes are defined in the header file.         regular expression. Various error codes are defined in the header file.
6264    
6265           NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
6266           use the contents of the preg structure. If, for example, you pass it to
6267           regexec(), the result is undefined and your program is likely to crash.
6268    
6269    
6270  MATCHING NEWLINE CHARACTERS  MATCHING NEWLINE CHARACTERS
6271    
# Line 6059  MATCHING A PATTERN Line 6341  MATCHING A PATTERN
6341         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
6342         regexec() are ignored.         regexec() are ignored.
6343    
6344           If the value of nmatch is zero, or if the value pmatch is NULL, no data
6345           about any matched strings is returned.
6346    
6347         Otherwise,the portion of the string that was matched, and also any cap-         Otherwise,the portion of the string that was matched, and also any cap-
6348         tured substrings, are returned via the pmatch argument, which points to         tured substrings, are returned via the pmatch argument, which points to
6349         an  array  of nmatch structures of type regmatch_t, containing the mem-         an array of nmatch structures of type regmatch_t, containing  the  mem-
6350         bers rm_so and rm_eo. These contain the offset to the  first  character         bers  rm_so  and rm_eo. These contain the offset to the first character
6351         of  each  substring and the offset to the first character after the end         of each substring and the offset to the first character after  the  end
6352         of each substring, respectively. The 0th element of the vector  relates         of  each substring, respectively. The 0th element of the vector relates
6353         to  the  entire portion of string that was matched; subsequent elements         to the entire portion of string that was matched;  subsequent  elements
6354         relate to the capturing subpatterns of the regular  expression.  Unused         relate  to  the capturing subpatterns of the regular expression. Unused
6355         entries in the array have both structure members set to -1.         entries in the array have both structure members set to -1.
6356    
6357         A  successful  match  yields  a  zero  return;  various error codes are         A successful match yields  a  zero  return;  various  error  codes  are
6358         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"         defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
6359         failure code.         failure code.
6360    
6361    
6362  ERROR MESSAGES  ERROR MESSAGES
6363    
6364         The regerror() function maps a non-zero errorcode from either regcomp()         The regerror() function maps a non-zero errorcode from either regcomp()
6365         or regexec() to a printable message. If preg is  not  NULL,  the  error         or  regexec()  to  a  printable message. If preg is not NULL, the error
6366         should have arisen from the use of that structure. A message terminated         should have arisen from the use of that structure. A message terminated
6367         by a binary zero is placed  in  errbuf.  The  length  of  the  message,         by  a  binary  zero  is  placed  in  errbuf. The length of the message,
6368         including  the  zero, is limited to errbuf_size. The yield of the func-         including the zero, is limited to errbuf_size. The yield of  the  func-
6369         tion is the size of buffer needed to hold the whole message.         tion is the size of buffer needed to hold the whole message.
6370    
6371    
6372  MEMORY USAGE  MEMORY USAGE
6373    
6374         Compiling a regular expression causes memory to be allocated and  asso-         Compiling  a regular expression causes memory to be allocated and asso-
6375         ciated  with  the preg structure. The function regfree() frees all such         ciated with the preg structure. The function regfree() frees  all  such
6376         memory, after which preg may no longer be used as  a  compiled  expres-         memory,  after  which  preg may no longer be used as a compiled expres-
6377         sion.         sion.
6378    
6379    
# Line 6101  AUTHOR Line 6386  AUTHOR
6386    
6387  REVISION  REVISION
6388    
6389         Last updated: 11 March 2009         Last updated: 02 September 2009
6390         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
6391  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6392    
6393    
6394  PCRECPP(3)                                                          PCRECPP(3)  PCRECPP(3)                                                          PCRECPP(3)
6395    
6396    
# Line 6445  REVISION Line 6730  REVISION
6730    
6731         Last updated: 17 March 2009         Last updated: 17 March 2009
6732  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6733    
6734    
6735  PCRESAMPLE(3)                                                    PCRESAMPLE(3)  PCRESAMPLE(3)                                                    PCRESAMPLE(3)
6736    
6737    
# Line 6457  NAME Line 6742  NAME
6742  PCRE SAMPLE PROGRAM  PCRE SAMPLE PROGRAM
6743    
6744         A simple, complete demonstration program, to get you started with using         A simple, complete demonstration program, to get you started with using
6745         PCRE, is supplied in the file pcredemo.c in the PCRE distribution.         PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
6746           listing  of this program is given in the pcredemo documentation. If you
6747           do not have a copy of the PCRE distribution, you can save this  listing
6748           to re-create pcredemo.c.
6749    
6750         The program compiles the regular expression that is its first argument,         The program compiles the regular expression that is its first argument,
6751         and  matches  it  against the subject string in its second argument. No         and matches it against the subject string in its  second  argument.  No
6752         PCRE options are set, and default character tables are used. If  match-         PCRE  options are set, and default character tables are used. If match-
6753         ing  succeeds,  the  program  outputs  the  portion of the subject that         ing succeeds, the program outputs  the  portion  of  the  subject  that
6754         matched, together with the contents of any captured substrings.         matched, together with the contents of any captured substrings.
6755    
6756         If the -g option is given on the command line, the program then goes on         If the -g option is given on the command line, the program then goes on
6757         to check for further matches of the same regular expression in the same         to check for further matches of the same regular expression in the same
6758         subject string. The logic is a little bit tricky because of the  possi-         subject  string. The logic is a little bit tricky because of the possi-
6759         bility  of  matching an empty string. Comments in the code explain what         bility of matching an empty string. Comments in the code  explain  what
6760         is going on.         is going on.
6761    
6762         If PCRE is installed in the standard include  and  library  directories         If  PCRE  is  installed in the standard include and library directories
6763         for  your  system, you should be able to compile the demonstration pro-         for your system, you should be able to compile the  demonstration  pro-
6764         gram using this command:         gram using this command:
6765    
6766           gcc -o pcredemo pcredemo.c -lpcre           gcc -o pcredemo pcredemo.c -lpcre
6767    
6768         If PCRE is installed elsewhere, you may need to add additional  options         If  PCRE is installed elsewhere, you may need to add additional options
6769         to  the  command line. For example, on a Unix-like system that has PCRE         to the command line. For example, on a Unix-like system that  has  PCRE
6770         installed in /usr/local, you  can  compile  the  demonstration  program         installed  in  /usr/local,  you  can  compile the demonstration program
6771         using a command like this:         using a command like this:
6772    
6773           gcc -o pcredemo -I/usr/local/include pcredemo.c \           gcc -o pcredemo -I/usr/local/include pcredemo.c \
6774               -L/usr/local/lib -lpcre               -L/usr/local/lib -lpcre
6775    
6776         Once  you  have  compiled the demonstration program, you can run simple         Once you have compiled the demonstration program, you  can  run  simple
6777         tests like this:         tests like this:
6778    
6779           ./pcredemo 'cat|dog' 'the cat sat on the mat'           ./pcredemo 'cat|dog' 'the cat sat on the mat'
6780           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
6781    
6782         Note that there is a  much  more  comprehensive  test  program,  called         Note  that  there  is  a  much  more comprehensive test program, called
6783         pcretest,  which  supports  many  more  facilities  for testing regular         pcretest, which supports  many  more  facilities  for  testing  regular
6784         expressions and the PCRE library. The pcredemo program is provided as a         expressions and the PCRE library. The pcredemo program is provided as a
6785         simple coding example.         simple coding example.
6786    
6787         On some operating systems (e.g. Solaris), when PCRE is not installed in         When you try to run pcredemo when PCRE is not installed in the standard
6788         the standard library directory, you may get an error like this when you         library  directory,  you  may  get an error like this on some operating
6789         try to run pcredemo:         systems (e.g. Solaris):
6790    
6791           ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or           ld.so.1: a.out: fatal: libpcre.so.0: open failed:  No  such  file  or
6792         directory         directory
6793    
6794         This is caused by the way shared library support works  on  those  sys-         This  is  caused  by the way shared library support works on those sys-
6795         tems. You need to add         tems. You need to add
6796    
6797           -R/usr/local/lib           -R/usr/local/lib
# Line 6520  AUTHOR Line 6808  AUTHOR
6808    
6809  REVISION  REVISION
6810    
6811         Last updated: 23 January 2008         Last updated: 01 September 2009
6812         Copyright (c) 1997-2008 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
6813  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6814  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
6815    
# Line 6659  REVISION Line 6947  REVISION
6947         Last updated: 09 July 2008         Last updated: 09 July 2008
6948         Copyright (c) 1997-2008 University of Cambridge.         Copyright (c) 1997-2008 University of Cambridge.
6949  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6950    
6951    

Legend:
Removed from v.406  
changed lines
  Added in v.453

  ViewVC Help
Powered by ViewVC 1.1.5