/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 429 by ph10, Tue Sep 1 16:10:16 2009 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 282  PCRE BUILD-TIME OPTIONS Line 282  PCRE BUILD-TIME OPTIONS
282         script,  where the optional features are selected or deselected by pro-         script,  where the optional features are selected or deselected by pro-
283         viding options to configure before running the make  command.  However,         viding options to configure before running the make  command.  However,
284         the  same  options  can be selected in both Unix-like and non-Unix-like         the  same  options  can be selected in both Unix-like and non-Unix-like
285         environments using the GUI facility of  CMakeSetup  if  you  are  using         environments using the GUI facility of cmake-gui if you are using CMake
286         CMake instead of configure to build PCRE.         instead of configure to build PCRE.
287    
288           There  is  a  lot more information about building PCRE in non-Unix-like
289           environments in the file called NON_UNIX_USE, which is part of the PCRE
290           distribution.  You  should consult this file as well as the README file
291           if you are building in a non-Unix-like environment.
292    
293         The complete list of options for configure (which includes the standard         The complete list of options for configure (which includes the standard
294         ones such as the  selection  of  the  installation  directory)  can  be         ones  such  as  the  selection  of  the  installation directory) can be
295         obtained by running         obtained by running
296    
297           ./configure --help           ./configure --help
298    
299         The  following  sections  include  descriptions  of options whose names         The following sections include  descriptions  of  options  whose  names
300         begin with --enable or --disable. These settings specify changes to the         begin with --enable or --disable. These settings specify changes to the
301         defaults  for  the configure command. Because of the way that configure         defaults for the configure command. Because of the way  that  configure
302         works, --enable and --disable always come in pairs, so  the  complemen-         works,  --enable  and --disable always come in pairs, so the complemen-
303         tary  option always exists as well, but as it specifies the default, it         tary option always exists as well, but as it specifies the default,  it
304         is not described.         is not described.
305    
306    
# Line 316  UTF-8 SUPPORT Line 321  UTF-8 SUPPORT
321    
322           --enable-utf8           --enable-utf8
323    
324         to the configure command. Of itself, this  does  not  make  PCRE  treat         to  the  configure  command.  Of  itself, this does not make PCRE treat
325         strings  as UTF-8. As well as compiling PCRE with this option, you also         strings as UTF-8. As well as compiling PCRE with this option, you  also
326         have have to set the PCRE_UTF8 option when you call the  pcre_compile()         have  have to set the PCRE_UTF8 option when you call the pcre_compile()
327         function.         function.
328    
329         If  you set --enable-utf8 when compiling in an EBCDIC environment, PCRE         If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE
330         expects its input to be either ASCII or UTF-8 (depending on the runtime         expects its input to be either ASCII or UTF-8 (depending on the runtime
331         option).  It  is not possible to support both EBCDIC and UTF-8 codes in         option). It is not possible to support both EBCDIC and UTF-8  codes  in
332         the same  version  of  the  library.  Consequently,  --enable-utf8  and         the  same  version  of  the  library.  Consequently,  --enable-utf8 and
333         --enable-ebcdic are mutually exclusive.         --enable-ebcdic are mutually exclusive.
334    
335    
336  UNICODE CHARACTER PROPERTY SUPPORT  UNICODE CHARACTER PROPERTY SUPPORT
337    
338         UTF-8  support allows PCRE to process character values greater than 255         UTF-8 support allows PCRE to process character values greater than  255
339         in the strings that it handles. On its own, however, it does  not  pro-         in  the  strings that it handles. On its own, however, it does not pro-
340         vide any facilities for accessing the properties of such characters. If         vide any facilities for accessing the properties of such characters. If
341         you want to be able to use the pattern escapes \P, \p,  and  \X,  which         you  want  to  be able to use the pattern escapes \P, \p, and \X, which
342         refer to Unicode character properties, you must add         refer to Unicode character properties, you must add
343    
344           --enable-unicode-properties           --enable-unicode-properties
345    
346         to  the configure command. This implies UTF-8 support, even if you have         to the configure command. This implies UTF-8 support, even if you  have
347         not explicitly requested it.         not explicitly requested it.
348    
349         Including Unicode property support adds around 30K  of  tables  to  the         Including  Unicode  property  support  adds around 30K of tables to the
350         PCRE  library.  Only  the general category properties such as Lu and Nd         PCRE library. Only the general category properties such as  Lu  and  Nd
351         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
352    
353    
354  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
355    
356         By default, PCRE interprets the linefeed (LF) character  as  indicating         By  default,  PCRE interprets the linefeed (LF) character as indicating
357         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
358         systems. You can compile PCRE to use carriage return (CR)  instead,  by         systems.  You  can compile PCRE to use carriage return (CR) instead, by
359         adding         adding
360    
361           --enable-newline-is-cr           --enable-newline-is-cr
362    
363         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
364         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
365    
366         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 367  CODE VALUE OF NEWLINE Line 372  CODE VALUE OF NEWLINE
372    
373           --enable-newline-is-anycrlf           --enable-newline-is-anycrlf
374    
375         which  causes  PCRE  to recognize any of the three sequences CR, LF, or         which causes PCRE to recognize any of the three sequences  CR,  LF,  or
376         CRLF as indicating a line ending. Finally, a fifth option, specified by         CRLF as indicating a line ending. Finally, a fifth option, specified by
377    
378           --enable-newline-is-any           --enable-newline-is-any
379    
380         causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
381    
382         Whatever line ending convention is selected when PCRE is built  can  be         Whatever  line  ending convention is selected when PCRE is built can be
383         overridden  when  the library functions are called. At build time it is         overridden when the library functions are called. At build time  it  is
384         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
385    
386    
387  WHAT \R MATCHES  WHAT \R MATCHES
388    
389         By default, the sequence \R in a pattern matches  any  Unicode  newline         By  default,  the  sequence \R in a pattern matches any Unicode newline
390         sequence,  whatever  has  been selected as the line ending sequence. If         sequence, whatever has been selected as the line  ending  sequence.  If
391         you specify         you specify
392    
393           --enable-bsr-anycrlf           --enable-bsr-anycrlf
394    
395         the default is changed so that \R matches only CR, LF, or  CRLF.  What-         the  default  is changed so that \R matches only CR, LF, or CRLF. What-
396         ever  is selected when PCRE is built can be overridden when the library         ever is selected when PCRE is built can be overridden when the  library
397         functions are called.         functions are called.
398    
399    
400  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
401    
402         The PCRE building process uses libtool to build both shared and  static         The  PCRE building process uses libtool to build both shared and static
403         Unix  libraries by default. You can suppress one of these by adding one         Unix libraries by default. You can suppress one of these by adding  one
404         of         of
405    
406           --disable-shared           --disable-shared
# Line 407  BUILDING SHARED AND STATIC LIBRARIES Line 412  BUILDING SHARED AND STATIC LIBRARIES
412  POSIX MALLOC USAGE  POSIX MALLOC USAGE
413    
414         When PCRE is called through the POSIX interface (see the pcreposix doc-         When PCRE is called through the POSIX interface (see the pcreposix doc-
415         umentation),  additional  working  storage  is required for holding the         umentation), additional working storage is  required  for  holding  the
416         pointers to capturing substrings, because PCRE requires three  integers         pointers  to capturing substrings, because PCRE requires three integers
417         per  substring,  whereas  the POSIX interface provides only two. If the         per substring, whereas the POSIX interface provides only  two.  If  the
418         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
419         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
420         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 422  POSIX MALLOC USAGE Line 427  POSIX MALLOC USAGE
427    
428  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
429    
430         Within  a  compiled  pattern,  offset values are used to point from one         Within a compiled pattern, offset values are used  to  point  from  one
431         part to another (for example, from an opening parenthesis to an  alter-         part  to another (for example, from an opening parenthesis to an alter-
432         nation  metacharacter).  By default, two-byte values are used for these         nation metacharacter). By default, two-byte values are used  for  these
433         offsets, leading to a maximum size for a  compiled  pattern  of  around         offsets,  leading  to  a  maximum size for a compiled pattern of around
434         64K.  This  is sufficient to handle all but the most gigantic patterns.         64K. This is sufficient to handle all but the most  gigantic  patterns.
435         Nevertheless, some people do want to process enormous patterns,  so  it         Nevertheless,  some  people do want to process enormous patterns, so it
436         is  possible  to compile PCRE to use three-byte or four-byte offsets by         is possible to compile PCRE to use three-byte or four-byte  offsets  by
437         adding a setting such as         adding a setting such as
438    
439           --with-link-size=3           --with-link-size=3
440    
441         to the configure command. The value given must be 2,  3,  or  4.  Using         to  the  configure  command.  The value given must be 2, 3, or 4. Using
442         longer  offsets slows down the operation of PCRE because it has to load         longer offsets slows down the operation of PCRE because it has to  load
443         additional bytes when handling them.         additional bytes when handling them.
444    
445    
446  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
447    
448         When matching with the pcre_exec() function, PCRE implements backtrack-         When matching with the pcre_exec() function, PCRE implements backtrack-
449         ing  by  making recursive calls to an internal function called match().         ing by making recursive calls to an internal function  called  match().
450         In environments where the size of the stack is limited,  this  can  se-         In  environments  where  the size of the stack is limited, this can se-
451         verely  limit  PCRE's operation. (The Unix environment does not usually         verely limit PCRE's operation. (The Unix environment does  not  usually
452         suffer from this problem, but it may sometimes be necessary to increase         suffer from this problem, but it may sometimes be necessary to increase
453         the  maximum  stack size.  There is a discussion in the pcrestack docu-         the maximum stack size.  There is a discussion in the  pcrestack  docu-
454         mentation.) An alternative approach to recursion that uses memory  from         mentation.)  An alternative approach to recursion that uses memory from
455         the  heap  to remember data, instead of using recursive function calls,         the heap to remember data, instead of using recursive  function  calls,
456         has been implemented to work round the problem of limited  stack  size.         has  been  implemented to work round the problem of limited stack size.
457         If you want to build a version of PCRE that works this way, add         If you want to build a version of PCRE that works this way, add
458    
459           --disable-stack-for-recursion           --disable-stack-for-recursion
460    
461         to  the  configure  command. With this configuration, PCRE will use the         to the configure command. With this configuration, PCRE  will  use  the
462         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
463         ment  functions. By default these point to malloc() and free(), but you         ment functions. By default these point to malloc() and free(), but  you
464         can replace the pointers so that your own functions are used.         can replace the pointers so that your own functions are used.
465    
466         Separate functions are  provided  rather  than  using  pcre_malloc  and         Separate  functions  are  provided  rather  than  using pcre_malloc and
467         pcre_free  because  the  usage  is  very  predictable:  the block sizes         pcre_free because the  usage  is  very  predictable:  the  block  sizes
468         requested are always the same, and  the  blocks  are  always  freed  in         requested  are  always  the  same,  and  the blocks are always freed in
469         reverse  order.  A calling program might be able to implement optimized         reverse order. A calling program might be able to  implement  optimized
470         functions that perform better  than  malloc()  and  free().  PCRE  runs         functions  that  perform  better  than  malloc()  and free(). PCRE runs
471         noticeably more slowly when built in this way. This option affects only         noticeably more slowly when built in this way. This option affects only
472         the  pcre_exec()  function;  it   is   not   relevant   for   the   the         the   pcre_exec()   function;   it   is   not   relevant  for  the  the
473         pcre_dfa_exec() function.         pcre_dfa_exec() function.
474    
475    
476  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
477    
478         Internally,  PCRE has a function called match(), which it calls repeat-         Internally, PCRE has a function called match(), which it calls  repeat-
479         edly  (sometimes  recursively)  when  matching  a  pattern   with   the         edly   (sometimes   recursively)  when  matching  a  pattern  with  the
480         pcre_exec()  function.  By controlling the maximum number of times this         pcre_exec() function. By controlling the maximum number of  times  this
481         function may be called during a single matching operation, a limit  can         function  may be called during a single matching operation, a limit can
482         be  placed  on  the resources used by a single call to pcre_exec(). The         be placed on the resources used by a single call  to  pcre_exec().  The
483         limit can be changed at run time, as described in the pcreapi  documen-         limit  can be changed at run time, as described in the pcreapi documen-
484         tation.  The default is 10 million, but this can be changed by adding a         tation. The default is 10 million, but this can be changed by adding  a
485         setting such as         setting such as
486    
487           --with-match-limit=500000           --with-match-limit=500000
488    
489         to  the  configure  command.  This  setting  has  no  effect   on   the         to   the   configure  command.  This  setting  has  no  effect  on  the
490         pcre_dfa_exec() matching function.         pcre_dfa_exec() matching function.
491    
492         In  some  environments  it is desirable to limit the depth of recursive         In some environments it is desirable to limit the  depth  of  recursive
493         calls of match() more strictly than the total number of calls, in order         calls of match() more strictly than the total number of calls, in order
494         to  restrict  the maximum amount of stack (or heap, if --disable-stack-         to restrict the maximum amount of stack (or heap,  if  --disable-stack-
495         for-recursion is specified) that is used. A second limit controls this;         for-recursion is specified) that is used. A second limit controls this;
496         it  defaults  to  the  value  that is set for --with-match-limit, which         it defaults to the value that  is  set  for  --with-match-limit,  which
497         imposes no additional constraints. However, you can set a  lower  limit         imposes  no  additional constraints. However, you can set a lower limit
498         by adding, for example,         by adding, for example,
499    
500           --with-match-limit-recursion=10000           --with-match-limit-recursion=10000
501    
502         to  the  configure  command.  This  value can also be overridden at run         to the configure command. This value can  also  be  overridden  at  run
503         time.         time.
504    
505    
506  CREATING CHARACTER TABLES AT BUILD TIME  CREATING CHARACTER TABLES AT BUILD TIME
507    
508         PCRE uses fixed tables for processing characters whose code values  are         PCRE  uses fixed tables for processing characters whose code values are
509         less  than 256. By default, PCRE is built with a set of tables that are         less than 256. By default, PCRE is built with a set of tables that  are
510         distributed in the file pcre_chartables.c.dist. These  tables  are  for         distributed  in  the  file pcre_chartables.c.dist. These tables are for
511         ASCII codes only. If you add         ASCII codes only. If you add
512    
513           --enable-rebuild-chartables           --enable-rebuild-chartables
514    
515         to  the  configure  command, the distributed tables are no longer used.         to the configure command, the distributed tables are  no  longer  used.
516         Instead, a program called dftables is compiled and  run.  This  outputs         Instead,  a  program  called dftables is compiled and run. This outputs
517         the source for new set of tables, created in the default locale of your         the source for new set of tables, created in the default locale of your
518         C runtime system. (This method of replacing the tables does not work if         C runtime system. (This method of replacing the tables does not work if
519         you  are cross compiling, because dftables is run on the local host. If         you are cross compiling, because dftables is run on the local host.  If
520         you need to create alternative tables when cross  compiling,  you  will         you  need  to  create alternative tables when cross compiling, you will
521         have to do so "by hand".)         have to do so "by hand".)
522    
523    
524  USING EBCDIC CODE  USING EBCDIC CODE
525    
526         PCRE  assumes  by  default that it will run in an environment where the         PCRE assumes by default that it will run in an  environment  where  the
527         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
528         This  is  the  case for most computer operating systems. PCRE can, how-         This is the case for most computer operating systems.  PCRE  can,  how-
529         ever, be compiled to run in an EBCDIC environment by adding         ever, be compiled to run in an EBCDIC environment by adding
530    
531           --enable-ebcdic           --enable-ebcdic
532    
533         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
534         bles.  You  should  only  use  it if you know that you are in an EBCDIC         bles. You should only use it if you know that  you  are  in  an  EBCDIC
535         environment (for example,  an  IBM  mainframe  operating  system).  The         environment  (for  example,  an  IBM  mainframe  operating system). The
536         --enable-ebcdic option is incompatible with --enable-utf8.         --enable-ebcdic option is incompatible with --enable-utf8.
537    
538    
# Line 541  PCREGREP OPTIONS FOR COMPRESSED FILE SUP Line 546  PCREGREP OPTIONS FOR COMPRESSED FILE SUP
546           --enable-pcregrep-libbz2           --enable-pcregrep-libbz2
547    
548         to the configure command. These options naturally require that the rel-         to the configure command. These options naturally require that the rel-
549         evant libraries are installed on your system. Configuration  will  fail         evant  libraries  are installed on your system. Configuration will fail
550         if they are not.         if they are not.
551    
552    
# Line 551  PCRETEST OPTION FOR LIBREADLINE SUPPORT Line 556  PCRETEST OPTION FOR LIBREADLINE SUPPORT
556    
557           --enable-pcretest-libreadline           --enable-pcretest-libreadline
558    
559         to  the  configure  command,  pcretest  is  linked with the libreadline         to the configure command,  pcretest  is  linked  with  the  libreadline
560         library, and when its input is from a terminal, it reads it  using  the         library,  and  when its input is from a terminal, it reads it using the
561         readline() function. This provides line-editing and history facilities.         readline() function. This provides line-editing and history facilities.
562         Note that libreadline is GPL-licenced, so if you distribute a binary of         Note that libreadline is GPL-licenced, so if you distribute a binary of
563         pcretest linked in this way, there may be licensing issues.         pcretest linked in this way, there may be licensing issues.
564    
565         Setting  this  option  causes  the -lreadline option to be added to the         Setting this option causes the -lreadline option to  be  added  to  the
566         pcretest build. In many operating environments with  a  sytem-installed         pcretest  build.  In many operating environments with a sytem-installed
567         libreadline this is sufficient. However, in some environments (e.g.  if         libreadline this is sufficient. However, in some environments (e.g.  if
568         an unmodified distribution version of readline is in use),  some  extra         an  unmodified  distribution version of readline is in use), some extra
569         configuration  may  be necessary. The INSTALL file for libreadline says         configuration may be necessary. The INSTALL file for  libreadline  says
570         this:         this:
571    
572           "Readline uses the termcap functions, but does not link with the           "Readline uses the termcap functions, but does not link with the
573           termcap or curses library itself, allowing applications which link           termcap or curses library itself, allowing applications which link
574           with readline the to choose an appropriate library."           with readline the to choose an appropriate library."
575    
576         If your environment has not been set up so that an appropriate  library         If  your environment has not been set up so that an appropriate library
577         is automatically included, you may need to add something like         is automatically included, you may need to add something like
578    
579           LIBS="-ncurses"           LIBS="-ncurses"
# Line 590  AUTHOR Line 595  AUTHOR
595    
596  REVISION  REVISION
597    
598         Last updated: 17 March 2009         Last updated: 06 September 2009
599         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
600  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
601    
# Line 696  THE ALTERNATIVE MATCHING ALGORITHM Line 701  THE ALTERNATIVE MATCHING ALGORITHM
701         at the fourth character of the subject. The algorithm does not automat-         at the fourth character of the subject. The algorithm does not automat-
702         ically move on to find matches that start at later positions.         ically move on to find matches that start at later positions.
703    
704           Although the general principle of this matching algorithm  is  that  it
705           scans  the subject string only once, without backtracking, there is one
706           exception: when a lookbehind assertion is  encountered,  the  preceding
707           characters have to be re-inspected.
708    
709         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
710         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
711    
712         1.  Because  the  algorithm  finds  all possible matches, the greedy or         1. Because the algorithm finds all  possible  matches,  the  greedy  or
713         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
714         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
715         sessive quantifiers can make a difference when what follows could  also         sessive  quantifiers can make a difference when what follows could also
716         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
717    
718           ^a++\w!           ^a++\w!
719    
720         This  pattern matches "aaab!" but not "aaa!", which would be matched by         This pattern matches "aaab!" but not "aaa!", which would be matched  by
721         a non-possessive quantifier. Similarly, if an atomic group is  present,         a  non-possessive quantifier. Similarly, if an atomic group is present,
722         it  is matched as if it were a standalone pattern at the current point,         it is matched as if it were a standalone pattern at the current  point,
723         and the longest match is then "locked in" for the rest of  the  overall         and  the  longest match is then "locked in" for the rest of the overall
724         pattern.         pattern.
725    
726         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
727         is not straightforward to keep track of  captured  substrings  for  the         is  not  straightforward  to  keep track of captured substrings for the
728         different  matching  possibilities,  and  PCRE's implementation of this         different matching possibilities, and  PCRE's  implementation  of  this
729         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
730         strings are available.         strings are available.
731    
732         3.  Because no substrings are captured, back references within the pat-         3. Because no substrings are captured, back references within the  pat-
733         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
734    
735         4. For the same reason, conditional expressions that use  a  backrefer-         4.  For  the same reason, conditional expressions that use a backrefer-
736         ence  as  the  condition or test for a specific group recursion are not         ence as the condition or test for a specific group  recursion  are  not
737         supported.         supported.
738    
739         5. Because many paths through the tree may be  active,  the  \K  escape         5.  Because  many  paths  through the tree may be active, the \K escape
740         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
741         be on some paths and not on others), is not  supported.  It  causes  an         be  on  some  paths  and not on others), is not supported. It causes an
742         error if encountered.         error if encountered.
743    
744         6.  Callouts  are  supported, but the value of the capture_top field is         6. Callouts are supported, but the value of the  capture_top  field  is
745         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
746    
747         7. The \C escape sequence, which (in the standard algorithm) matches  a         7.  The \C escape sequence, which (in the standard algorithm) matches a
748         single  byte, even in UTF-8 mode, is not supported because the alterna-         single byte, even in UTF-8 mode, is not supported because the  alterna-
749         tive algorithm moves through the subject  string  one  character  at  a         tive  algorithm  moves  through  the  subject string one character at a
750         time, for all active paths through the tree.         time, for all active paths through the tree.
751    
752         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
753         are not supported. (*FAIL) is supported, and  behaves  like  a  failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
754         negative assertion.         negative assertion.
755    
756    
757  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
758    
759         Using  the alternative matching algorithm provides the following advan-         Using the alternative matching algorithm provides the following  advan-
760         tages:         tages:
761    
762         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
763         ically  found,  and  in particular, the longest match is found. To find         ically found, and in particular, the longest match is  found.  To  find
764         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
765         things with callouts.         things with callouts.
766    
767         2.  Because  the  alternative  algorithm  scans the subject string just         2. Because the alternative algorithm  scans  the  subject  string  just
768         once, and never needs to backtrack, it is possible to  pass  very  long         once,  and  never  needs to backtrack, it is possible to pass very long
769         subject  strings  to  the matching function in several pieces, checking         subject strings to the matching function in  several  pieces,  checking
770         for partial matching each time.         for partial matching each time.
771    
772    
# Line 764  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 774  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
774    
775         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
776    
777         1. It is substantially slower than  the  standard  algorithm.  This  is         1.  It  is  substantially  slower  than the standard algorithm. This is
778         partly  because  it has to search for all possible matches, but is also         partly because it has to search for all possible matches, but  is  also
779         because it is less susceptible to optimization.         because it is less susceptible to optimization.
780    
781         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 783  AUTHOR Line 793  AUTHOR
793    
794  REVISION  REVISION
795    
796         Last updated: 25 August 2009         Last updated: 05 September 2009
797         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
798  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
799    
# Line 902  PCRE API OVERVIEW Line 912  PCRE API OVERVIEW
912         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
913         ble, is also provided. This uses a different algorithm for  the  match-         ble, is also provided. This uses a different algorithm for  the  match-
914         ing.  The  alternative algorithm finds all possible matches (at a given         ing.  The  alternative algorithm finds all possible matches (at a given
915         point in the subject), and scans the subject just once.  However,  this         point in the subject), and scans the subject just  once  (unless  there
916         algorithm does not return captured substrings. A description of the two         are  lookbehind  assertions).  However,  this algorithm does not return
917         matching algorithms and their advantages and disadvantages is given  in         captured substrings. A description of the two matching  algorithms  and
918         the pcrematching documentation.         their  advantages  and disadvantages is given in the pcrematching docu-
919           mentation.
920    
921         In  addition  to  the  main compiling and matching functions, there are         In addition to the main compiling and  matching  functions,  there  are
922         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
923         string that is matched by pcre_exec(). They are:         string that is matched by pcre_exec(). They are:
924    
# Line 922  PCRE API OVERVIEW Line 933  PCRE API OVERVIEW
933         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
934         to free the memory used for extracted strings.         to free the memory used for extracted strings.
935    
936         The function pcre_maketables() is used to  build  a  set  of  character         The  function  pcre_maketables()  is  used  to build a set of character
937         tables   in   the   current   locale  for  passing  to  pcre_compile(),         tables  in  the  current  locale   for   passing   to   pcre_compile(),
938         pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is         pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
939         provided  for  specialist  use.  Most  commonly,  no special tables are         provided for specialist use.  Most  commonly,  no  special  tables  are
940         passed, in which case internal tables that are generated when  PCRE  is         passed,  in  which case internal tables that are generated when PCRE is
941         built are used.         built are used.
942    
943         The  function  pcre_fullinfo()  is used to find out information about a         The function pcre_fullinfo() is used to find out  information  about  a
944         compiled pattern; pcre_info() is an obsolete version that returns  only         compiled  pattern; pcre_info() is an obsolete version that returns only
945         some  of  the available information, but is retained for backwards com-         some of the available information, but is retained for  backwards  com-
946         patibility.  The function pcre_version() returns a pointer to a  string         patibility.   The function pcre_version() returns a pointer to a string
947         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
948    
949         The  function  pcre_refcount()  maintains  a  reference count in a data         The function pcre_refcount() maintains a  reference  count  in  a  data
950         block containing a compiled pattern. This is provided for  the  benefit         block  containing  a compiled pattern. This is provided for the benefit
951         of object-oriented applications.         of object-oriented applications.
952    
953         The  global  variables  pcre_malloc and pcre_free initially contain the         The global variables pcre_malloc and pcre_free  initially  contain  the
954         entry points of the standard malloc()  and  free()  functions,  respec-         entry  points  of  the  standard malloc() and free() functions, respec-
955         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
956         so a calling program can replace them if it  wishes  to  intercept  the         so  a  calling  program  can replace them if it wishes to intercept the
957         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
958    
959         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
960         indirections to memory management functions.  These  special  functions         indirections  to  memory  management functions. These special functions
961         are  used  only  when  PCRE is compiled to use the heap for remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
962         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
963         function.  See  the  pcrebuild  documentation  for details of how to do         function. See the pcrebuild documentation for  details  of  how  to  do
964         this. It is a non-standard way of building PCRE, for  use  in  environ-         this.  It  is  a non-standard way of building PCRE, for use in environ-
965         ments  that  have  limited stacks. Because of the greater use of memory         ments that have limited stacks. Because of the greater  use  of  memory
966         management, it runs more slowly. Separate  functions  are  provided  so         management,  it  runs  more  slowly. Separate functions are provided so
967         that  special-purpose  external  code  can  be used for this case. When         that special-purpose external code can be  used  for  this  case.  When
968         used, these functions are always called in a  stack-like  manner  (last         used,  these  functions  are always called in a stack-like manner (last
969         obtained,  first freed), and always for memory blocks of the same size.         obtained, first freed), and always for memory blocks of the same  size.
970         There is a discussion about PCRE's stack usage in the  pcrestack  docu-         There  is  a discussion about PCRE's stack usage in the pcrestack docu-
971         mentation.         mentation.
972    
973         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
974         by the caller to a "callout" function, which PCRE  will  then  call  at         by  the  caller  to  a "callout" function, which PCRE will then call at
975         specified  points during a matching operation. Details are given in the         specified points during a matching operation. Details are given in  the
976         pcrecallout documentation.         pcrecallout documentation.
977    
978    
979  NEWLINES  NEWLINES
980    
981         PCRE supports five different conventions for indicating line breaks  in         PCRE  supports five different conventions for indicating line breaks in
982         strings:  a  single  CR (carriage return) character, a single LF (line-         strings: a single CR (carriage return) character, a  single  LF  (line-
983         feed) character, the two-character sequence CRLF, any of the three pre-         feed) character, the two-character sequence CRLF, any of the three pre-
984         ceding,  or any Unicode newline sequence. The Unicode newline sequences         ceding, or any Unicode newline sequence. The Unicode newline  sequences
985         are the three just mentioned, plus the single characters  VT  (vertical         are  the  three just mentioned, plus the single characters VT (vertical
986         tab,  U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
987         separator, U+2028), and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
988    
989         Each of the first three conventions is used by at least  one  operating         Each  of  the first three conventions is used by at least one operating
990         system  as its standard newline sequence. When PCRE is built, a default         system as its standard newline sequence. When PCRE is built, a  default
991         can be specified.  The default default is LF, which is the  Unix  stan-         can  be  specified.  The default default is LF, which is the Unix stan-
992         dard.  When  PCRE  is run, the default can be overridden, either when a         dard. When PCRE is run, the default can be overridden,  either  when  a
993         pattern is compiled, or when it is matched.         pattern is compiled, or when it is matched.
994    
995         At compile time, the newline convention can be specified by the options         At compile time, the newline convention can be specified by the options
996         argument  of  pcre_compile(), or it can be specified by special text at         argument of pcre_compile(), or it can be specified by special  text  at
997         the start of the pattern itself; this overrides any other settings. See         the start of the pattern itself; this overrides any other settings. See
998         the pcrepattern page for details of the special character sequences.         the pcrepattern page for details of the special character sequences.
999    
1000         In the PCRE documentation the word "newline" is used to mean "the char-         In the PCRE documentation the word "newline" is used to mean "the char-
1001         acter or pair of characters that indicate a line break". The choice  of         acter  or pair of characters that indicate a line break". The choice of
1002         newline  convention  affects  the  handling of the dot, circumflex, and         newline convention affects the handling of  the  dot,  circumflex,  and
1003         dollar metacharacters, the handling of #-comments in /x mode, and, when         dollar metacharacters, the handling of #-comments in /x mode, and, when
1004         CRLF  is a recognized line ending sequence, the match position advance-         CRLF is a recognized line ending sequence, the match position  advance-
1005         ment for a non-anchored pattern. There is more detail about this in the         ment for a non-anchored pattern. There is more detail about this in the
1006         section on pcre_exec() options below.         section on pcre_exec() options below.
1007    
1008         The  choice of newline convention does not affect the interpretation of         The choice of newline convention does not affect the interpretation  of
1009         the \n or \r escape sequences, nor does  it  affect  what  \R  matches,         the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
1010         which is controlled in a similar way, but by separate options.         which is controlled in a similar way, but by separate options.
1011    
1012    
1013  MULTITHREADING  MULTITHREADING
1014    
1015         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
1016         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
1017         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1018         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
1019    
1020         The compiled form of a regular expression is not altered during  match-         The  compiled form of a regular expression is not altered during match-
1021         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
1022         at once.         at once.
1023    
# Line 1014  MULTITHREADING Line 1025  MULTITHREADING
1025  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
1026    
1027         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
1028         later  time,  possibly by a different program, and even on a host other         later time, possibly by a different program, and even on a  host  other
1029         than the one on which  it  was  compiled.  Details  are  given  in  the         than  the  one  on  which  it  was  compiled.  Details are given in the
1030         pcreprecompile  documentation.  However, compiling a regular expression         pcreprecompile documentation. However, compiling a  regular  expression
1031         with one version of PCRE for use with a different version is not  guar-         with  one version of PCRE for use with a different version is not guar-
1032         anteed to work and may cause crashes.         anteed to work and may cause crashes.
1033    
1034    
# Line 1025  CHECKING BUILD-TIME OPTIONS Line 1036  CHECKING BUILD-TIME OPTIONS
1036    
1037         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1038    
1039         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
1040         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
1041         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
1042         tures.         tures.
1043    
1044         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
1045         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
1046         into which the information is  placed.  The  following  information  is         into  which  the  information  is  placed. The following information is
1047         available:         available:
1048    
1049           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
1050    
1051         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
1052         able; otherwise it is set to zero.         able; otherwise it is set to zero.
1053    
1054           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
1055    
1056         The output is an integer that is set to  one  if  support  for  Unicode         The  output  is  an  integer  that is set to one if support for Unicode
1057         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
1058    
1059           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1060    
1061         The  output  is  an integer whose value specifies the default character         The output is an integer whose value specifies  the  default  character
1062         sequence that is recognized as meaning "newline". The four values  that         sequence  that is recognized as meaning "newline". The four values that
1063         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1064         and -1 for ANY.  Though they are derived from ASCII,  the  same  values         and  -1  for  ANY.  Though they are derived from ASCII, the same values
1065         are returned in EBCDIC environments. The default should normally corre-         are returned in EBCDIC environments. The default should normally corre-
1066         spond to the standard sequence for your operating system.         spond to the standard sequence for your operating system.
1067    
1068           PCRE_CONFIG_BSR           PCRE_CONFIG_BSR
1069    
1070         The output is an integer whose value indicates what character sequences         The output is an integer whose value indicates what character sequences
1071         the  \R  escape sequence matches by default. A value of 0 means that \R         the \R escape sequence matches by default. A value of 0 means  that  \R
1072         matches any Unicode line ending sequence; a value of 1  means  that  \R         matches  any  Unicode  line ending sequence; a value of 1 means that \R
1073         matches only CR, LF, or CRLF. The default can be overridden when a pat-         matches only CR, LF, or CRLF. The default can be overridden when a pat-
1074         tern is compiled or matched.         tern is compiled or matched.
1075    
1076           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1077    
1078         The output is an integer that contains the number  of  bytes  used  for         The  output  is  an  integer that contains the number of bytes used for
1079         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
1080         4. Larger values allow larger regular expressions to  be  compiled,  at         4.  Larger  values  allow larger regular expressions to be compiled, at
1081         the  expense  of  slower matching. The default value of 2 is sufficient         the expense of slower matching. The default value of  2  is  sufficient
1082         for all but the most massive patterns, since  it  allows  the  compiled         for  all  but  the  most massive patterns, since it allows the compiled
1083         pattern to be up to 64K in size.         pattern to be up to 64K in size.
1084    
1085           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1086    
1087         The  output  is  an integer that contains the threshold above which the         The output is an integer that contains the threshold  above  which  the
1088         POSIX interface uses malloc() for output vectors. Further  details  are         POSIX  interface  uses malloc() for output vectors. Further details are
1089         given in the pcreposix documentation.         given in the pcreposix documentation.
1090    
1091           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1092    
1093         The  output is a long integer that gives the default limit for the num-         The output is a long integer that gives the default limit for the  num-
1094         ber of internal matching function calls  in  a  pcre_exec()  execution.         ber  of  internal  matching  function calls in a pcre_exec() execution.
1095         Further details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1096    
1097           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1098    
1099         The output is a long integer that gives the default limit for the depth         The output is a long integer that gives the default limit for the depth
1100         of  recursion  when  calling  the  internal  matching  function  in   a         of   recursion  when  calling  the  internal  matching  function  in  a
1101         pcre_exec()  execution.  Further  details  are  given  with pcre_exec()         pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1102         below.         below.
1103    
1104           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1105    
1106         The output is an integer that is set to one if internal recursion  when         The  output is an integer that is set to one if internal recursion when
1107         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
1108         the stack to remember their state. This is the usual way that  PCRE  is         the  stack  to remember their state. This is the usual way that PCRE is
1109         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
1110         on the  heap  instead  of  recursive  function  calls.  In  this  case,         on  the  heap  instead  of  recursive  function  calls.  In  this case,
1111         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1112         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1113    
1114    
# Line 1114  COMPILING A PATTERN Line 1125  COMPILING A PATTERN
1125    
1126         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
1127         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
1128         the two interfaces is that pcre_compile2() has an additional  argument,         the  two interfaces is that pcre_compile2() has an additional argument,
1129         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error code can be returned.
1130    
1131         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
1132         the pattern argument. A pointer to a single block  of  memory  that  is         the  pattern  argument.  A  pointer to a single block of memory that is
1133         obtained  via  pcre_malloc is returned. This contains the compiled code         obtained via pcre_malloc is returned. This contains the  compiled  code
1134         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
1135         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
1136         It is up to the caller to free the memory (via pcre_free) when it is no         It is up to the caller to free the memory (via pcre_free) when it is no
1137         longer required.         longer required.
1138    
1139         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
1140         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1141         fully  relocatable, because it may contain a copy of the tableptr argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
1142         ment, which is an address (see below).         ment, which is an address (see below).
1143    
1144         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
1145         pilation.  It  should be zero if no options are required. The available         pilation. It should be zero if no options are required.  The  available
1146         options are described below. Some of them (in  particular,  those  that         options  are  described  below. Some of them (in particular, those that
1147         are  compatible  with  Perl,  but also some others) can also be set and         are compatible with Perl, but also some others) can  also  be  set  and
1148         unset from within the pattern (see  the  detailed  description  in  the         unset  from  within  the  pattern  (see the detailed description in the
1149         pcrepattern  documentation). For those options that can be different in         pcrepattern documentation). For those options that can be different  in
1150         different parts of the pattern, the contents of  the  options  argument         different  parts  of  the pattern, the contents of the options argument
1151         specifies their initial settings at the start of compilation and execu-         specifies their initial settings at the start of compilation and execu-
1152         tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the         tion.  The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
1153         time of matching as well as at compile time.         time of matching as well as at compile time.
1154    
1155         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1156         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1157         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1158         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
1159         try to free it. The offset from the start of the pattern to the charac-         try to free it. The offset from the start of the pattern to the charac-
1160         ter where the error was discovered is placed in the variable pointed to         ter where the error was discovered is placed in the variable pointed to
1161         by  erroffset,  which must not be NULL. If it is, an immediate error is         by erroffset, which must not be NULL. If it is, an immediate  error  is
1162         given.         given.
1163    
1164         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1165         codeptr  argument is not NULL, a non-zero error code number is returned         codeptr argument is not NULL, a non-zero error code number is  returned
1166         via this argument in the event of an error. This is in addition to  the         via  this argument in the event of an error. This is in addition to the
1167         textual error message. Error codes and messages are listed below.         textual error message. Error codes and messages are listed below.
1168    
1169         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1170         character tables that are  built  when  PCRE  is  compiled,  using  the         character  tables  that  are  built  when  PCRE  is compiled, using the
1171         default  C  locale.  Otherwise, tableptr must be an address that is the         default C locale. Otherwise, tableptr must be an address  that  is  the
1172         result of a call to pcre_maketables(). This value is  stored  with  the         result  of  a  call to pcre_maketables(). This value is stored with the
1173         compiled  pattern,  and used again by pcre_exec(), unless another table         compiled pattern, and used again by pcre_exec(), unless  another  table
1174         pointer is passed to it. For more discussion, see the section on locale         pointer is passed to it. For more discussion, see the section on locale
1175         support below.         support below.
1176    
1177         This  code  fragment  shows a typical straightforward call to pcre_com-         This code fragment shows a typical straightforward  call  to  pcre_com-
1178         pile():         pile():
1179    
1180           pcre *re;           pcre *re;
# Line 1176  COMPILING A PATTERN Line 1187  COMPILING A PATTERN
1187             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
1188             NULL);            /* use default character tables */             NULL);            /* use default character tables */
1189    
1190         The following names for option bits are defined in  the  pcre.h  header         The  following  names  for option bits are defined in the pcre.h header
1191         file:         file:
1192    
1193           PCRE_ANCHORED           PCRE_ANCHORED
1194    
1195         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
1196         is constrained to match only at the first matching point in the  string         is  constrained to match only at the first matching point in the string
1197         that  is being searched (the "subject string"). This effect can also be         that is being searched (the "subject string"). This effect can also  be
1198         achieved by appropriate constructs in the pattern itself, which is  the         achieved  by appropriate constructs in the pattern itself, which is the
1199         only way to do it in Perl.         only way to do it in Perl.
1200    
1201           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
1202    
1203         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
1204         all with number 255, before each pattern item. For  discussion  of  the         all  with  number  255, before each pattern item. For discussion of the
1205         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
1206    
1207           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
1208           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
1209    
1210         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
1211         sequence matches. The choice is either to match only CR, LF,  or  CRLF,         sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1212         or to match any Unicode newline sequence. The default is specified when         or to match any Unicode newline sequence. The default is specified when
1213         PCRE is built. It can be overridden from within the pattern, or by set-         PCRE is built. It can be overridden from within the pattern, or by set-
1214         ting an option when a compiled pattern is matched.         ting an option when a compiled pattern is matched.
1215    
1216           PCRE_CASELESS           PCRE_CASELESS
1217    
1218         If  this  bit is set, letters in the pattern match both upper and lower         If this bit is set, letters in the pattern match both upper  and  lower
1219         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1220         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1221         always understands the concept of case for characters whose values  are         always  understands the concept of case for characters whose values are
1222         less  than 128, so caseless matching is always possible. For characters         less than 128, so caseless matching is always possible. For  characters
1223         with higher values, the concept of case is supported if  PCRE  is  com-         with  higher  values,  the concept of case is supported if PCRE is com-
1224         piled  with Unicode property support, but not otherwise. If you want to         piled with Unicode property support, but not otherwise. If you want  to
1225         use caseless matching for characters 128 and  above,  you  must  ensure         use  caseless  matching  for  characters 128 and above, you must ensure
1226         that  PCRE  is  compiled  with Unicode property support as well as with         that PCRE is compiled with Unicode property support  as  well  as  with
1227         UTF-8 support.         UTF-8 support.
1228    
1229           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
1230    
1231         If this bit is set, a dollar metacharacter in the pattern matches  only         If  this bit is set, a dollar metacharacter in the pattern matches only
1232         at  the  end  of the subject string. Without this option, a dollar also         at the end of the subject string. Without this option,  a  dollar  also
1233         matches immediately before a newline at the end of the string (but  not         matches  immediately before a newline at the end of the string (but not
1234         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1235         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1236         Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1237    
1238           PCRE_DOTALL           PCRE_DOTALL
1239    
1240         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
1241         acters, including those that indicate newline. Without it, a  dot  does         acters,  including  those that indicate newline. Without it, a dot does
1242         not  match  when  the  current position is at a newline. This option is         not match when the current position is at a  newline.  This  option  is
1243         equivalent to Perl's /s option, and it can be changed within a  pattern         equivalent  to Perl's /s option, and it can be changed within a pattern
1244         by  a (?s) option setting. A negative class such as [^a] always matches         by a (?s) option setting. A negative class such as [^a] always  matches
1245         newline characters, independent of the setting of this option.         newline characters, independent of the setting of this option.
1246    
1247           PCRE_DUPNAMES           PCRE_DUPNAMES
1248    
1249         If this bit is set, names used to identify capturing  subpatterns  need         If  this  bit is set, names used to identify capturing subpatterns need
1250         not be unique. This can be helpful for certain types of pattern when it         not be unique. This can be helpful for certain types of pattern when it
1251         is known that only one instance of the named  subpattern  can  ever  be         is  known  that  only  one instance of the named subpattern can ever be
1252         matched.  There  are  more details of named subpatterns below; see also         matched. There are more details of named subpatterns  below;  see  also
1253         the pcrepattern documentation.         the pcrepattern documentation.
1254    
1255           PCRE_EXTENDED           PCRE_EXTENDED
1256    
1257         If this bit is set, whitespace  data  characters  in  the  pattern  are         If  this  bit  is  set,  whitespace  data characters in the pattern are
1258         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1259         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1260         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1261         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1262         option,  and  it  can be changed within a pattern by a (?x) option set-         option, and it can be changed within a pattern by a  (?x)  option  set-
1263         ting.         ting.
1264    
1265         This option makes it possible to include  comments  inside  complicated         This  option  makes  it possible to include comments inside complicated
1266         patterns.   Note,  however,  that this applies only to data characters.         patterns.  Note, however, that this applies only  to  data  characters.
1267         Whitespace  characters  may  never  appear  within  special   character         Whitespace   characters  may  never  appear  within  special  character
1268         sequences  in  a  pattern,  for  example  within the sequence (?( which         sequences in a pattern, for  example  within  the  sequence  (?(  which
1269         introduces a conditional subpattern.         introduces a conditional subpattern.
1270    
1271           PCRE_EXTRA           PCRE_EXTRA
1272    
1273         This option was invented in order to turn on  additional  functionality         This  option  was invented in order to turn on additional functionality
1274         of  PCRE  that  is  incompatible with Perl, but it is currently of very         of PCRE that is incompatible with Perl, but it  is  currently  of  very
1275         little use. When set, any backslash in a pattern that is followed by  a         little  use. When set, any backslash in a pattern that is followed by a
1276         letter  that  has  no  special  meaning causes an error, thus reserving         letter that has no special meaning  causes  an  error,  thus  reserving
1277         these combinations for future expansion. By  default,  as  in  Perl,  a         these  combinations  for  future  expansion.  By default, as in Perl, a
1278         backslash  followed by a letter with no special meaning is treated as a         backslash followed by a letter with no special meaning is treated as  a
1279         literal. (Perl can, however, be persuaded to give a warning for  this.)         literal.  (Perl can, however, be persuaded to give a warning for this.)
1280         There  are  at  present no other features controlled by this option. It         There are at present no other features controlled by  this  option.  It
1281         can also be set by a (?X) option setting within a pattern.         can also be set by a (?X) option setting within a pattern.
1282    
1283           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1284    
1285         If this option is set, an  unanchored  pattern  is  required  to  match         If  this  option  is  set,  an  unanchored pattern is required to match
1286         before  or  at  the  first  newline  in  the subject string, though the         before or at the first  newline  in  the  subject  string,  though  the
1287         matched text may continue over the newline.         matched text may continue over the newline.
1288    
1289           PCRE_JAVASCRIPT_COMPAT           PCRE_JAVASCRIPT_COMPAT
1290    
1291         If this option is set, PCRE's behaviour is changed in some ways so that         If this option is set, PCRE's behaviour is changed in some ways so that
1292         it  is  compatible with JavaScript rather than Perl. The changes are as         it is compatible with JavaScript rather than Perl. The changes  are  as
1293         follows:         follows:
1294    
1295         (1) A lone closing square bracket in a pattern  causes  a  compile-time         (1)  A  lone  closing square bracket in a pattern causes a compile-time
1296         error,  because this is illegal in JavaScript (by default it is treated         error, because this is illegal in JavaScript (by default it is  treated
1297         as a data character). Thus, the pattern AB]CD becomes illegal when this         as a data character). Thus, the pattern AB]CD becomes illegal when this
1298         option is set.         option is set.
1299    
1300         (2)  At run time, a back reference to an unset subpattern group matches         (2) At run time, a back reference to an unset subpattern group  matches
1301         an empty string (by default this causes the current  matching  alterna-         an  empty  string (by default this causes the current matching alterna-
1302         tive  to  fail). A pattern such as (\1)(a) succeeds when this option is         tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1303         set (assuming it can find an "a" in the subject), whereas it  fails  by         set  (assuming  it can find an "a" in the subject), whereas it fails by
1304         default, for Perl compatibility.         default, for Perl compatibility.
1305    
1306           PCRE_MULTILINE           PCRE_MULTILINE
1307    
1308         By  default,  PCRE  treats the subject string as consisting of a single         By default, PCRE treats the subject string as consisting  of  a  single
1309         line of characters (even if it actually contains newlines). The  "start         line  of characters (even if it actually contains newlines). The "start
1310         of  line"  metacharacter  (^)  matches only at the start of the string,         of line" metacharacter (^) matches only at the  start  of  the  string,
1311         while the "end of line" metacharacter ($) matches only at  the  end  of         while  the  "end  of line" metacharacter ($) matches only at the end of
1312         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1313         is set). This is the same as Perl.         is set). This is the same as Perl.
1314    
1315         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1316         constructs  match  immediately following or immediately before internal         constructs match immediately following or immediately  before  internal
1317         newlines in the subject string, respectively, as well as  at  the  very         newlines  in  the  subject string, respectively, as well as at the very
1318         start  and  end.  This is equivalent to Perl's /m option, and it can be         start and end. This is equivalent to Perl's /m option, and  it  can  be
1319         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
1320         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1321         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1322    
1323           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1315  COMPILING A PATTERN Line 1326  COMPILING A PATTERN
1326           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1327           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1328    
1329         These options override the default newline definition that  was  chosen         These  options  override the default newline definition that was chosen
1330         when  PCRE  was built. Setting the first or the second specifies that a         when PCRE was built. Setting the first or the second specifies  that  a
1331         newline is indicated by a single character (CR  or  LF,  respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1332         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1333         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1334         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
1335         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1336         recognized. The Unicode newline sequences are the three just mentioned,         recognized. The Unicode newline sequences are the three just mentioned,
1337         plus the single characters VT (vertical  tab,  U+000B),  FF  (formfeed,         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1338         U+000C),  NEL  (next line, U+0085), LS (line separator, U+2028), and PS         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1339         (paragraph separator, U+2029). The last  two  are  recognized  only  in         (paragraph  separator,  U+2029).  The  last  two are recognized only in
1340         UTF-8 mode.         UTF-8 mode.
1341    
1342         The  newline  setting  in  the  options  word  uses three bits that are         The newline setting in the  options  word  uses  three  bits  that  are
1343         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
1344         used  (default  plus the five values above). This means that if you set         used (default plus the five values above). This means that if  you  set
1345         more than one newline option, the combination may or may not be  sensi-         more  than one newline option, the combination may or may not be sensi-
1346         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1347         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1348         cause an error.         cause an error.
1349    
1350         The  only time that a line break is specially recognized when compiling         The only time that a line break is specially recognized when  compiling
1351         a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1352         character  class  is  encountered.  This indicates a comment that lasts         character class is encountered. This indicates  a  comment  that  lasts
1353         until after the next line break sequence. In other circumstances,  line         until  after the next line break sequence. In other circumstances, line
1354         break   sequences   are   treated  as  literal  data,  except  that  in         break  sequences  are  treated  as  literal  data,   except   that   in
1355         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1356         and are therefore ignored.         and are therefore ignored.
1357    
# Line 1350  COMPILING A PATTERN Line 1361  COMPILING A PATTERN
1361           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1362    
1363         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
1364         theses  in the pattern. Any opening parenthesis that is not followed by         theses in the pattern. Any opening parenthesis that is not followed  by
1365         ? behaves as if it were followed by ?: but named parentheses can  still         ?  behaves as if it were followed by ?: but named parentheses can still
1366         be  used  for  capturing  (and  they acquire numbers in the usual way).         be used for capturing (and they acquire  numbers  in  the  usual  way).
1367         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1368    
1369           PCRE_UNGREEDY           PCRE_UNGREEDY
1370    
1371         This option inverts the "greediness" of the quantifiers  so  that  they         This  option  inverts  the "greediness" of the quantifiers so that they
1372         are  not greedy by default, but become greedy if followed by "?". It is         are not greedy by default, but become greedy if followed by "?". It  is
1373         not compatible with Perl. It can also be set by a (?U)  option  setting         not  compatible  with Perl. It can also be set by a (?U) option setting
1374         within the pattern.         within the pattern.
1375    
1376           PCRE_UTF8           PCRE_UTF8
1377    
1378         This  option  causes PCRE to regard both the pattern and the subject as         This option causes PCRE to regard both the pattern and the  subject  as
1379         strings of UTF-8 characters instead of single-byte  character  strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1380         However,  it is available only when PCRE is built to include UTF-8 sup-         However, it is available only when PCRE is built to include UTF-8  sup-
1381         port. If not, the use of this option provokes an error. Details of  how         port.  If not, the use of this option provokes an error. Details of how
1382         this  option  changes the behaviour of PCRE are given in the section on         this option changes the behaviour of PCRE are given in the  section  on
1383         UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1384    
1385           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1386    
1387         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1388         automatically  checked.  There  is  a  discussion about the validity of         automatically checked. There is a  discussion  about  the  validity  of
1389         UTF-8 strings in the main pcre page. If an invalid  UTF-8  sequence  of         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1390         bytes  is  found,  pcre_compile() returns an error. If you already know         bytes is found, pcre_compile() returns an error. If  you  already  know
1391         that your pattern is valid, and you want to skip this check for perfor-         that your pattern is valid, and you want to skip this check for perfor-
1392         mance  reasons,  you  can set the PCRE_NO_UTF8_CHECK option. When it is         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1393         set, the effect of passing an invalid UTF-8  string  as  a  pattern  is         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1394         undefined.  It  may  cause your program to crash. Note that this option         undefined. It may cause your program to crash. Note  that  this  option
1395         can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress  the         can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1396         UTF-8 validity checking of subject strings.         UTF-8 validity checking of subject strings.
1397    
1398    
1399  COMPILATION ERROR CODES  COMPILATION ERROR CODES
1400    
1401         The  following  table  lists  the  error  codes than may be returned by         The following table lists the error  codes  than  may  be  returned  by
1402         pcre_compile2(), along with the error messages that may be returned  by         pcre_compile2(),  along with the error messages that may be returned by
1403         both  compiling functions. As PCRE has developed, some error codes have         both compiling functions. As PCRE has developed, some error codes  have
1404         fallen out of use. To avoid confusion, they have not been re-used.         fallen out of use. To avoid confusion, they have not been re-used.
1405    
1406            0  no error            0  no error
# Line 1445  COMPILATION ERROR CODES Line 1456  COMPILATION ERROR CODES
1456           50  [this code is not in use]           50  [this code is not in use]
1457           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 (not in UTF-8 mode)
1458           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
1459           53  internal  error:  previously-checked  referenced  subpattern  not           53   internal  error:  previously-checked  referenced  subpattern not
1460         found         found
1461           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
1462           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
# Line 1460  COMPILATION ERROR CODES Line 1471  COMPILATION ERROR CODES
1471           63  digit expected after (?+           63  digit expected after (?+
1472           64  ] is an invalid data character in JavaScript compatibility mode           64  ] is an invalid data character in JavaScript compatibility mode
1473    
1474         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different         The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1475         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
1476    
1477    
# Line 1469  STUDYING A PATTERN Line 1480  STUDYING A PATTERN
1480         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1481              const char **errptr);              const char **errptr);
1482    
1483         If a compiled pattern is going to be used several times,  it  is  worth         If  a  compiled  pattern is going to be used several times, it is worth
1484         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1485         matching. The function pcre_study() takes a pointer to a compiled  pat-         matching.  The function pcre_study() takes a pointer to a compiled pat-
1486         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1487         information that will help speed up matching,  pcre_study()  returns  a         information  that  will  help speed up matching, pcre_study() returns a
1488         pointer  to a pcre_extra block, in which the study_data field points to         pointer to a pcre_extra block, in which the study_data field points  to
1489         the results of the study.         the results of the study.
1490    
1491         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1492         pcre_exec().  However,  a  pcre_extra  block also contains other fields         pcre_exec(). However, a pcre_extra block  also  contains  other  fields
1493         that can be set by the caller before the block  is  passed;  these  are         that  can  be  set  by the caller before the block is passed; these are
1494         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1495    
1496         If  studying  the  pattern  does not produce any additional information         If studying the pattern does not  produce  any  additional  information
1497         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1498         wants  to  pass  any of the other fields to pcre_exec(), it must set up         wants to pass any of the other fields to pcre_exec(), it  must  set  up
1499         its own pcre_extra block.         its own pcre_extra block.
1500    
1501         The second argument of pcre_study() contains option bits.  At  present,         The  second  argument of pcre_study() contains option bits. At present,
1502         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1503    
1504         The  third argument for pcre_study() is a pointer for an error message.         The third argument for pcre_study() is a pointer for an error  message.
1505         If studying succeeds (even if no data is  returned),  the  variable  it         If  studying  succeeds  (even  if no data is returned), the variable it
1506         points  to  is  set  to NULL. Otherwise it is set to point to a textual         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1507         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1508         must  not  try  to  free it. You should test the error pointer for NULL         must not try to free it. You should test the  error  pointer  for  NULL
1509         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1510    
1511         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 1506  STUDYING A PATTERN Line 1517  STUDYING A PATTERN
1517             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1518    
1519         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1520         that  do not have a single fixed starting character. A bitmap of possi-         that do not have a single fixed starting character. A bitmap of  possi-
1521         ble starting bytes is created.         ble starting bytes is created.
1522    
1523    
1524  LOCALE SUPPORT  LOCALE SUPPORT
1525    
1526         PCRE handles caseless matching, and determines whether  characters  are         PCRE  handles  caseless matching, and determines whether characters are
1527         letters,  digits, or whatever, by reference to a set of tables, indexed         letters, digits, or whatever, by reference to a set of tables,  indexed
1528         by character value. When running in UTF-8 mode, this  applies  only  to         by  character  value.  When running in UTF-8 mode, this applies only to
1529         characters  with  codes  less than 128. Higher-valued codes never match         characters with codes less than 128. Higher-valued  codes  never  match
1530         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1531         with  Unicode  character property support. The use of locales with Uni-         with Unicode character property support. The use of locales  with  Uni-
1532         code is discouraged. If you are handling characters with codes  greater         code  is discouraged. If you are handling characters with codes greater
1533         than  128, you should either use UTF-8 and Unicode, or use locales, but         than 128, you should either use UTF-8 and Unicode, or use locales,  but
1534         not try to mix the two.         not try to mix the two.
1535    
1536         PCRE contains an internal set of tables that are used  when  the  final         PCRE  contains  an  internal set of tables that are used when the final
1537         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1538         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
1539         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
1540         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
1541         which may cause them to be different.         which may cause them to be different.
1542    
1543         The  internal tables can always be overridden by tables supplied by the         The internal tables can always be overridden by tables supplied by  the
1544         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
1545         from  the  default.  As more and more applications change to using Uni-         from the default. As more and more applications change  to  using  Uni-
1546         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
1547    
1548         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
1549         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
1550         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1551         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
1552         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
1553         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1554    
1555           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1556           tables = pcre_maketables();           tables = pcre_maketables();
1557           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1558    
1559         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1560         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
1561    
1562         When pcre_maketables() runs, the tables are built  in  memory  that  is         When  pcre_maketables()  runs,  the  tables are built in memory that is
1563         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1564         that the memory containing the tables remains available for as long  as         that  the memory containing the tables remains available for as long as
1565         it is needed.         it is needed.
1566    
1567         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1568         pattern, and the same tables are used via this pointer by  pcre_study()         pattern,  and the same tables are used via this pointer by pcre_study()
1569         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1570         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1571         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1572    
1573         It  is  possible to pass a table pointer or NULL (indicating the use of         It is possible to pass a table pointer or NULL (indicating the  use  of
1574         the internal tables) to pcre_exec(). Although  not  intended  for  this         the  internal  tables)  to  pcre_exec(). Although not intended for this
1575         purpose,  this facility could be used to match a pattern in a different         purpose, this facility could be used to match a pattern in a  different
1576         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1577         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1578    
# Line 1571  INFORMATION ABOUT A PATTERN Line 1582  INFORMATION ABOUT A PATTERN
1582         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1583              int what, void *where);              int what, void *where);
1584    
1585         The  pcre_fullinfo() function returns information about a compiled pat-         The pcre_fullinfo() function returns information about a compiled  pat-
1586         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1587         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1588    
1589         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
1590         pattern. The second argument is the result of pcre_study(), or NULL  if         pattern.  The second argument is the result of pcre_study(), or NULL if
1591         the  pattern  was not studied. The third argument specifies which piece         the pattern was not studied. The third argument specifies  which  piece
1592         of information is required, and the fourth argument is a pointer  to  a         of  information  is required, and the fourth argument is a pointer to a
1593         variable  to  receive  the  data. The yield of the function is zero for         variable to receive the data. The yield of the  function  is  zero  for
1594         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1595    
1596           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1587  INFORMATION ABOUT A PATTERN Line 1598  INFORMATION ABOUT A PATTERN
1598           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1599           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1600    
1601         The "magic number" is placed at the start of each compiled  pattern  as         The  "magic  number" is placed at the start of each compiled pattern as
1602         an  simple check against passing an arbitrary memory pointer. Here is a         an simple check against passing an arbitrary memory pointer. Here is  a
1603         typical call of pcre_fullinfo(), to obtain the length of  the  compiled         typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1604         pattern:         pattern:
1605    
1606           int rc;           int rc;
# Line 1600  INFORMATION ABOUT A PATTERN Line 1611  INFORMATION ABOUT A PATTERN
1611             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1612             &length);         /* where to put the data */             &length);         /* where to put the data */
1613    
1614         The  possible  values for the third argument are defined in pcre.h, and         The possible values for the third argument are defined in  pcre.h,  and
1615         are as follows:         are as follows:
1616    
1617           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1618    
1619         Return the number of the highest back reference  in  the  pattern.  The         Return  the  number  of  the highest back reference in the pattern. The
1620         fourth  argument  should  point to an int variable. Zero is returned if         fourth argument should point to an int variable. Zero  is  returned  if
1621         there are no back references.         there are no back references.
1622    
1623           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1624    
1625         Return the number of capturing subpatterns in the pattern.  The  fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
1626         argument should point to an int variable.         argument should point to an int variable.
1627    
1628           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1629    
1630         Return  a pointer to the internal default character tables within PCRE.         Return a pointer to the internal default character tables within  PCRE.
1631         The fourth argument should point to an unsigned char *  variable.  This         The  fourth  argument should point to an unsigned char * variable. This
1632         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1633         tion. External callers can cause PCRE to use  its  internal  tables  by         tion.  External  callers  can  cause PCRE to use its internal tables by
1634         passing a NULL table pointer.         passing a NULL table pointer.
1635    
1636           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1637    
1638         Return  information  about  the first byte of any matched string, for a         Return information about the first byte of any matched  string,  for  a
1639         non-anchored pattern. The fourth argument should point to an int  vari-         non-anchored  pattern. The fourth argument should point to an int vari-
1640         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1641         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1642    
1643         If there is a fixed first byte, for example, from  a  pattern  such  as         If  there  is  a  fixed first byte, for example, from a pattern such as
1644         (cat|cow|coyote), its value is returned. Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1645    
1646         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1647         branch starts with "^", or         branch starts with "^", or
1648    
1649         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1650         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1651    
1652         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
1653         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
1654         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1655    
1656           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1657    
1658         If  the pattern was studied, and this resulted in the construction of a         If the pattern was studied, and this resulted in the construction of  a
1659         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1660         matching  string, a pointer to the table is returned. Otherwise NULL is         matching string, a pointer to the table is returned. Otherwise NULL  is
1661         returned. The fourth argument should point to an unsigned char *  vari-         returned.  The fourth argument should point to an unsigned char * vari-
1662         able.         able.
1663    
1664           PCRE_INFO_HASCRORLF           PCRE_INFO_HASCRORLF
1665    
1666         Return  1  if  the  pattern  contains any explicit matches for CR or LF         Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1667         characters, otherwise 0. The fourth argument should  point  to  an  int         characters,  otherwise  0.  The  fourth argument should point to an int
1668         variable.  An explicit match is either a literal CR or LF character, or         variable. An explicit match is either a literal CR or LF character,  or
1669         \r or \n.         \r or \n.
1670    
1671           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
1672    
1673         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,         Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1674         otherwise  0. The fourth argument should point to an int variable. (?J)         otherwise 0. The fourth argument should point to an int variable.  (?J)
1675         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1676    
1677           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1678    
1679         Return the value of the rightmost literal byte that must exist  in  any         Return  the  value of the rightmost literal byte that must exist in any
1680         matched  string,  other  than  at  its  start,  if such a byte has been         matched string, other than at its  start,  if  such  a  byte  has  been
1681         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1682         is  no such byte, -1 is returned. For anchored patterns, a last literal         is no such byte, -1 is returned. For anchored patterns, a last  literal
1683         byte is recorded only if it follows something of variable  length.  For         byte  is  recorded only if it follows something of variable length. For
1684         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1685         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1686    
# Line 1677  INFORMATION ABOUT A PATTERN Line 1688  INFORMATION ABOUT A PATTERN
1688           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1689           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1690    
1691         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1692         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1693         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
1694         pcre_get_named_substring()  are  provided  for extracting captured sub-         pcre_get_named_substring() are provided for  extracting  captured  sub-
1695         strings by name. It is also possible to extract the data  directly,  by         strings  by  name. It is also possible to extract the data directly, by
1696         first  converting  the  name to a number in order to access the correct         first converting the name to a number in order to  access  the  correct
1697         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
1698         the  conversion,  you  need  to  use  the  name-to-number map, which is         the conversion, you need  to  use  the  name-to-number  map,  which  is
1699         described by these three values.         described by these three values.
1700    
1701         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1702         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1703         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
1704         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1705         a pointer to the first entry of the table  (a  pointer  to  char).  The         a  pointer  to  the  first  entry of the table (a pointer to char). The
1706         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1707         sis, most significant byte first. The rest of the entry is  the  corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1708         sponding  name,  zero  terminated. The names are in alphabetical order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1709         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1710         theses  numbers.  For  example,  consider the following pattern (assume         theses numbers. For example, consider  the  following  pattern  (assume
1711         PCRE_EXTENDED is  set,  so  white  space  -  including  newlines  -  is         PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1712         ignored):         ignored):
1713    
1714           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1715           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1716    
1717         There  are  four  named subpatterns, so the table has four entries, and         There are four named subpatterns, so the table has  four  entries,  and
1718         each entry in the table is eight bytes long. The table is  as  follows,         each  entry  in the table is eight bytes long. The table is as follows,
1719         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1720         as ??:         as ??:
1721    
# Line 1713  INFORMATION ABOUT A PATTERN Line 1724  INFORMATION ABOUT A PATTERN
1724           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1725           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1726    
1727         When writing code to extract data  from  named  subpatterns  using  the         When  writing  code  to  extract  data from named subpatterns using the
1728         name-to-number  map,  remember that the length of the entries is likely         name-to-number map, remember that the length of the entries  is  likely
1729         to be different for each compiled pattern.         to be different for each compiled pattern.
1730    
1731           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
1732    
1733         Return 1 if the pattern can be used for partial matching, otherwise  0.         Return  1  if  the  pattern  can  be  used  for  partial  matching with
1734         The fourth argument should point to an int variable. From release 8.00,         pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
1735         this always returns 1, because the restrictions that previously applied         variable.  From  release  8.00,  this  always  returns  1,  because the
1736         to  partial  matching  have  been lifted. The pcrepartial documentation         restrictions that previously applied  to  partial  matching  have  been
1737         gives details of partial matching.         lifted.  The  pcrepartial documentation gives details of partial match-
1738           ing.
1739    
1740           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1741    
# Line 1909  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1921  MATCHING A PATTERN: THE TRADITIONAL FUNC
1921         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1922         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1923    
1924         The  pcre_callout  field is used in conjunction with the "callout" fea-         The  callout_data  field is used in conjunction with the "callout" fea-
1925         ture, which is described in the pcrecallout documentation.         ture, and is described in the pcrecallout documentation.
1926    
1927         The tables field  is  used  to  pass  a  character  tables  pointer  to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1928         pcre_exec();  this overrides the value that is stored with the compiled         pcre_exec();  this overrides the value that is stored with the compiled
# Line 1927  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1939  MATCHING A PATTERN: THE TRADITIONAL FUNC
1939    
1940         The  unused  bits of the options argument for pcre_exec() must be zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1941         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1942         PCRE_NOTBOL,    PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_START_OPTIMIZE,         PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
1943         PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.         PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and
1944           PCRE_PARTIAL_HARD.
1945    
1946           PCRE_ANCHORED           PCRE_ANCHORED
1947    
1948         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
1949         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
1950         turned out to be anchored by virtue of its contents, it cannot be  made         turned  out to be anchored by virtue of its contents, it cannot be made
1951         unachored at matching time.         unachored at matching time.
1952    
1953           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
1954           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
1955    
1956         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
1957         sequence matches. The choice is either to match only CR, LF,  or  CRLF,         sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1958         or  to  match  any Unicode newline sequence. These options override the         or to match any Unicode newline sequence. These  options  override  the
1959         choice that was made or defaulted when the pattern was compiled.         choice that was made or defaulted when the pattern was compiled.
1960    
1961           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1951  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1964  MATCHING A PATTERN: THE TRADITIONAL FUNC
1964           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1965           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1966    
1967         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
1968         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
1969         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
1970         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1971         ters. It may also alter the way the match position is advanced after  a         ters.  It may also alter the way the match position is advanced after a
1972         match failure for an unanchored pattern.         match failure for an unanchored pattern.
1973    
1974         When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is         When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is
1975         set, and a match attempt for an unanchored pattern fails when the  cur-         set,  and a match attempt for an unanchored pattern fails when the cur-
1976         rent  position  is  at  a  CRLF  sequence,  and the pattern contains no         rent position is at a  CRLF  sequence,  and  the  pattern  contains  no
1977         explicit matches for  CR  or  LF  characters,  the  match  position  is         explicit  matches  for  CR  or  LF  characters,  the  match position is
1978         advanced by two characters instead of one, in other words, to after the         advanced by two characters instead of one, in other words, to after the
1979         CRLF.         CRLF.
1980    
1981         The above rule is a compromise that makes the most common cases work as         The above rule is a compromise that makes the most common cases work as
1982         expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL         expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL
1983         option is not set), it does not match the string "\r\nA" because, after         option is not set), it does not match the string "\r\nA" because, after
1984         failing  at the start, it skips both the CR and the LF before retrying.         failing at the start, it skips both the CR and the LF before  retrying.
1985         However, the pattern [\r\n]A does match that string,  because  it  con-         However,  the  pattern  [\r\n]A does match that string, because it con-
1986         tains an explicit CR or LF reference, and so advances only by one char-         tains an explicit CR or LF reference, and so advances only by one char-
1987         acter after the first failure.         acter after the first failure.
1988    
1989         An explicit match for CR of LF is either a literal appearance of one of         An explicit match for CR of LF is either a literal appearance of one of
1990         those  characters,  or  one  of the \r or \n escape sequences. Implicit         those characters, or one of the \r or  \n  escape  sequences.  Implicit
1991         matches such as [^X] do not count, nor does \s (which includes  CR  and         matches  such  as [^X] do not count, nor does \s (which includes CR and
1992         LF in the characters that it matches).         LF in the characters that it matches).
1993    
1994         Notwithstanding  the above, anomalous effects may still occur when CRLF         Notwithstanding the above, anomalous effects may still occur when  CRLF
1995         is a valid newline sequence and explicit \r or \n escapes appear in the         is a valid newline sequence and explicit \r or \n escapes appear in the
1996         pattern.         pattern.
1997    
1998           PCRE_NOTBOL           PCRE_NOTBOL
1999    
2000         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
2001         the beginning of a line, so the  circumflex  metacharacter  should  not         the  beginning  of  a  line, so the circumflex metacharacter should not
2002         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match before it. Setting this without PCRE_MULTILINE (at compile  time)
2003         causes circumflex never to match. This option affects only  the  behav-         causes  circumflex  never to match. This option affects only the behav-
2004         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
2005    
2006           PCRE_NOTEOL           PCRE_NOTEOL
2007    
2008         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
2009         of a line, so the dollar metacharacter should not match it nor  (except         of  a line, so the dollar metacharacter should not match it nor (except
2010         in  multiline mode) a newline immediately before it. Setting this with-         in multiline mode) a newline immediately before it. Setting this  with-
2011         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2012         option  affects only the behaviour of the dollar metacharacter. It does         option affects only the behaviour of the dollar metacharacter. It  does
2013         not affect \Z or \z.         not affect \Z or \z.
2014    
2015           PCRE_NOTEMPTY           PCRE_NOTEMPTY
2016    
2017         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
2018         set.  If  there are alternatives in the pattern, they are tried. If all         set. If there are alternatives in the pattern, they are tried.  If  all
2019         the alternatives match the empty string, the entire  match  fails.  For         the  alternatives  match  the empty string, the entire match fails. For
2020         example, if the pattern         example, if the pattern
2021    
2022           a?b?           a?b?
2023    
2024         is  applied  to  a string not beginning with "a" or "b", it matches the         is applied to a string not beginning with "a" or  "b",  it  matches  an
2025         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
2026         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
2027         rences of "a" or "b".         rences of "a" or "b".
2028    
2029         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-           PCRE_NOTEMPTY_ATSTART
2030         cial  case  of  a  pattern match of the empty string within its split()  
2031         function, and when using the /g modifier. It  is  possible  to  emulate         This  is  like PCRE_NOTEMPTY, except that an empty string match that is
2032         Perl's behaviour after matching a null string by first trying the match         not at the start of  the  subject  is  permitted.  If  the  pattern  is
2033         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         anchored, such a match can occur only if the pattern contains \K.
2034         if  that  fails by advancing the starting offset (see below) and trying  
2035         an ordinary match again. There is some code that demonstrates how to do         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
2036         this in the pcredemo sample program.         PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
2037           match  of  the empty string within its split() function, and when using
2038           the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
2039           matching a null string by first trying the match again at the same off-
2040           set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
2041           fails, by advancing the starting offset (see below) and trying an ordi-
2042           nary match again. There is some code that demonstrates how to  do  this
2043           in the pcredemo sample program.
2044    
2045           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2046    
# Line 2066  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2086  MATCHING A PATTERN: THE TRADITIONAL FUNC
2086         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,
2087         matching continues by testing any other alternatives. Only if they  all         matching continues by testing any other alternatives. Only if they  all
2088         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).
2089         The portion of the string that provided the partial match is set as the         The portion of the string that was inspected when the partial match was
2090         first  matching  string.  There  is  a  more detailed discussion in the         found  is  set  as  the first matching string. There is a more detailed
2091         pcrepartial documentation.         discussion in the pcrepartial documentation.
2092    
2093     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2094    
# Line 2484  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2504  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2504         characteristics to the normal algorithm, and  is  not  compatible  with         characteristics to the normal algorithm, and  is  not  compatible  with
2505         Perl.  Some  of the features of PCRE patterns are not supported. Never-         Perl.  Some  of the features of PCRE patterns are not supported. Never-
2506         theless, there are times when this kind of matching can be useful.  For         theless, there are times when this kind of matching can be useful.  For
2507         a discussion of the two matching algorithms, see the pcrematching docu-         a  discussion  of  the  two matching algorithms, and a list of features
2508         mentation.         that pcre_dfa_exec() does not support, see the pcrematching  documenta-
2509           tion.
2510    
2511         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2512         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2513         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2514         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2515         repeated here.         repeated here.
2516    
2517         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2518         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2519         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2520         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2521         lot of potential matches.         lot of potential matches.
2522    
2523         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2518  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2539  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2539    
2540     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2541    
2542         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2543         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2544         LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
2545         PCRE_PARTIAL_HARD,     PCRE_PARTIAL_SOFT,     PCRE_DFA_SHORTEST,    and         PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
2546         PCRE_DFA_RESTART. All but the last four of these are exactly  the  same         TIAL_SOFT,  PCRE_DFA_SHORTEST,  and  PCRE_DFA_RESTART. All but the last
2547         as for pcre_exec(), so their description is not repeated here.         four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
2548           description is not repeated here.
2549    
2550           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2551           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
# Line 2537  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2559  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2559         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2560         of the subject is reached, there have been  no  complete  matches,  but         of the subject is reached, there have been  no  complete  matches,  but
2561         there  is  still  at least one matching possibility. The portion of the         there  is  still  at least one matching possibility. The portion of the
2562         string that provided the longest partial match  is  set  as  the  first         string that was inspected when the longest partial match was  found  is
2563         matching string in both cases.         set as the first matching string in both cases.
2564    
2565           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2566    
# Line 2644  AUTHOR Line 2666  AUTHOR
2666    
2667  REVISION  REVISION
2668    
2669         Last updated: 01 September 2009         Last updated: 11 September 2009
2670         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2671  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2672    
# Line 2869  DIFFERENCES BETWEEN PCRE AND PERL Line 2891  DIFFERENCES BETWEEN PCRE AND PERL
2891         is  built  with Unicode character property support. The properties that         is  built  with Unicode character property support. The properties that
2892         can be tested with \p and \P are limited to the general category  prop-         can be tested with \p and \P are limited to the general category  prop-
2893         erties  such  as  Lu and Nd, script names such as Greek or Han, and the         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2894         derived properties Any and L&.         derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
2895           property,  which  Perl  does  not; the Perl documentation says "Because
2896           Perl hides the need for the user to understand the internal representa-
2897           tion  of Unicode characters, there is no need to implement the somewhat
2898           messy concept of surrogates."
2899    
2900         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2901         ters  in  between  are  treated as literals. This is slightly different         ters  in  between  are  treated as literals. This is slightly different
# Line 2889  DIFFERENCES BETWEEN PCRE AND PERL Line 2915  DIFFERENCES BETWEEN PCRE AND PERL
2915    
2916         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2917         constructions. However, there is support for recursive  patterns.  This         constructions. However, there is support for recursive  patterns.  This
2918         is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE         is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
2919         "callout" feature allows an external function to be called during  pat-         "callout" feature allows an external function to be called during  pat-
2920         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
2921    
2922         9.  Subpatterns  that  are  called  recursively or as "subroutines" are         9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2923         always treated as atomic groups in  PCRE.  This  is  like  Python,  but         always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2924         unlike Perl.         unlike  Perl. There is a discussion of an example that explains this in
2925           more detail in the section on recursion differences from  Perl  in  the
2926           pcrecompat page.
2927    
2928         10.  There are some differences that are concerned with the settings of         10.  There are some differences that are concerned with the settings of
2929         captured strings when part of  a  pattern  is  repeated.  For  example,         captured strings when part of  a  pattern  is  repeated.  For  example,
# Line 2904  DIFFERENCES BETWEEN PCRE AND PERL Line 2932  DIFFERENCES BETWEEN PCRE AND PERL
2932    
2933         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
2934         (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in         (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
2935         the forms without an  argument.  PCRE  does  not  support  (*MARK).  If         the forms without an argument. PCRE does not support (*MARK).
        (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-  
        ture group; this is different to Perl.  
2936    
2937         12. PCRE provides some extensions to the Perl regular expression facil-         12. PCRE provides some extensions to the Perl regular expression facil-
2938         ities.   Perl  5.10  will  include new features that are not in earlier         ities.   Perl  5.10  will  include new features that are not in earlier
# Line 2931  DIFFERENCES BETWEEN PCRE AND PERL Line 2957  DIFFERENCES BETWEEN PCRE AND PERL
2957         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2958         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2959    
2960         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
2961         TURE options for pcre_exec() have no Perl equivalents.         and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
2962           lents.
2963    
2964         (g) The \R escape sequence can be restricted to match only CR,  LF,  or         (g)  The  \R escape sequence can be restricted to match only CR, LF, or
2965         CRLF by the PCRE_BSR_ANYCRLF option.         CRLF by the PCRE_BSR_ANYCRLF option.
2966    
2967         (h) The callout facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
# Line 2944  DIFFERENCES BETWEEN PCRE AND PERL Line 2971  DIFFERENCES BETWEEN PCRE AND PERL
2971         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
2972         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2973    
2974         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2975         different way and is not Perl-compatible.         different way and is not Perl-compatible.
2976    
2977         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
2978         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
2979         pattern.         pattern.
2980    
# Line 2961  AUTHOR Line 2988  AUTHOR
2988    
2989  REVISION  REVISION
2990    
2991         Last updated: 25 August 2009         Last updated: 18 September 2009
2992         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
2993  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2994    
# Line 3480  BACKSLASH Line 3507  BACKSLASH
3507         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see
3508         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
3509         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in
3510         the pcreapi page).         the pcreapi page). Perl does not support the Cs property.
3511    
3512         The  long  synonyms  for  these  properties that Perl supports (such as         The  long  synonyms  for  property  names  that  Perl supports (such as
3513         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
3514         any of these properties with "Is".         any of these properties with "Is".
3515    
# Line 4707  RECURSIVE PATTERNS Line 4734  RECURSIVE PATTERNS
4734         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4735         it  supports  special  syntax  for recursion of the entire pattern, and         it  supports  special  syntax  for recursion of the entire pattern, and
4736         also for individual subpattern recursion.  After  its  introduction  in         also for individual subpattern recursion.  After  its  introduction  in
4737         PCRE  and  Python,  this  kind of recursion was introduced into Perl at         PCRE  and  Python,  this  kind of recursion was subsequently introduced
4738         release 5.10.         into Perl at release 5.10.
4739    
4740         A special item that consists of (? followed by a  number  greater  than         A special item that consists of (? followed by a  number  greater  than
4741         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
# Line 4717  RECURSIVE PATTERNS Line 4744  RECURSIVE PATTERNS
4744         tion.) The special item (?R) or (?0) is a recursive call of the  entire         tion.) The special item (?R) or (?0) is a recursive call of the  entire
4745         regular expression.         regular expression.
4746    
4747         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is         This  PCRE  pattern  solves  the nested parentheses problem (assume the
        always treated as an atomic group. That is, once it has matched some of  
        the subject string, it is never re-entered, even if it contains untried  
        alternatives and there is a subsequent matching failure.  
   
        This PCRE pattern solves the nested  parentheses  problem  (assume  the  
4748         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
4749    
4750           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
4751    
4752         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
4753         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
4754         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
4755         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
4756    
4757         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
4758         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
4759    
4760           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4761    
4762         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
4763         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
4764    
4765         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
4766         tricky.  This is made easier by the use of relative references. (A Perl         tricky. This is made easier by the use of relative references. (A  Perl
4767         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write         5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
4768         (?-2) to refer to the second most recently opened parentheses preceding         (?-2) to refer to the second most recently opened parentheses preceding
4769         the recursion. In other  words,  a  negative  number  counts  capturing         the  recursion.  In  other  words,  a  negative number counts capturing
4770         parentheses leftwards from the point at which it is encountered.         parentheses leftwards from the point at which it is encountered.
4771    
4772         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
4773         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
4774         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
4775         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
4776         section.         section.
4777    
4778         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
4779         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
4780         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
4781    
4782           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4783    
4784         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
4785         one is used.         one is used.
4786    
4787         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
4788         nested  unlimited repeats, and so the use of atomic grouping for match-         nested unlimited repeats, and so the use of atomic grouping for  match-
4789         ing strings of non-parentheses is important when applying  the  pattern         ing  strings  of non-parentheses is important when applying the pattern
4790         to strings that do not match. For example, when this pattern is applied         to strings that do not match. For example, when this pattern is applied
4791         to         to
4792    
4793           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4794    
4795         it yields "no match" quickly. However, if atomic grouping is not  used,         it  yields "no match" quickly. However, if atomic grouping is not used,
4796         the  match  runs  for a very long time indeed because there are so many         the match runs for a very long time indeed because there  are  so  many
4797         different ways the + and * repeats can carve up the  subject,  and  all         different  ways  the  + and * repeats can carve up the subject, and all
4798         have to be tested before failure can be reported.         have to be tested before failure can be reported.
4799    
4800         At the end of a match, the values set for any capturing subpatterns are         At the end of a match, the values set for any capturing subpatterns are
4801         those from the outermost level of the recursion at which the subpattern         those from the outermost level of the recursion at which the subpattern
4802         value  is  set.   If  you want to obtain intermediate values, a callout         value is set.  If you want to obtain  intermediate  values,  a  callout
4803         function can be used (see below and the pcrecallout documentation).  If         function  can be used (see below and the pcrecallout documentation). If
4804         the pattern above is matched against         the pattern above is matched against
4805    
4806           (ab(cd)ef)           (ab(cd)ef)
4807    
4808         the  value  for  the  capturing  parentheses is "ef", which is the last         the value for the capturing parentheses is  "ef",  which  is  the  last
4809         value taken on at the top level. If additional parentheses  are  added,         value  taken  on at the top level. If additional parentheses are added,
4810         giving         giving
4811    
4812           \( ( ( (?>[^()]+) | (?R) )* ) \)           \( ( ( (?>[^()]+) | (?R) )* ) \)
4813              ^                        ^              ^                        ^
4814              ^                        ^              ^                        ^
4815    
4816         the  string  they  capture is "ab(cd)ef", the contents of the top level         the string they capture is "ab(cd)ef", the contents of  the  top  level
4817         parentheses. If there are more than 15 capturing parentheses in a  pat-         parentheses.  If there are more than 15 capturing parentheses in a pat-
4818         tern, PCRE has to obtain extra memory to store data during a recursion,         tern, PCRE has to obtain extra memory to store data during a recursion,
4819         which it does by using pcre_malloc, freeing  it  via  pcre_free  after-         which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
4820         wards.  If  no  memory  can  be  obtained,  the  match  fails  with the         wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
4821         PCRE_ERROR_NOMEMORY error.         PCRE_ERROR_NOMEMORY error.
4822    
4823         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
4824         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
4825         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
4826         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
4827         ted at the outer level.         ted at the outer level.
4828    
4829           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
4830    
4831         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
4832         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
4833         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
4834    
4835       Recursion difference from Perl
4836    
4837           In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
4838           always treated as an atomic group. That is, once it has matched some of
4839           the subject string, it is never re-entered, even if it contains untried
4840           alternatives and there is a subsequent matching failure.  This  can  be
4841           illustrated  by the following pattern, which purports to match a palin-
4842           dromic string that contains an odd number of characters  (for  example,
4843           "a", "aba", "abcba", "abcdcba"):
4844    
4845             ^(.|(.)(?1)\2)$
4846    
4847           The idea is that it either matches a single character, or two identical
4848           characters surrounding a sub-palindrome. In Perl, this  pattern  works;
4849           in  PCRE  it  does  not if the pattern is longer than three characters.
4850           Consider the subject string "abcba":
4851    
4852           At the top level, the first character is matched, but as it is  not  at
4853           the end of the string, the first alternative fails; the second alterna-
4854           tive is taken and the recursion kicks in. The recursive call to subpat-
4855           tern  1  successfully  matches the next character ("b"). (Note that the
4856           beginning and end of line tests are not part of the recursion).
4857    
4858           Back at the top level, the next character ("c") is compared  with  what
4859           subpattern  2 matched, which was "a". This fails. Because the recursion
4860           is treated as an atomic group, there are now  no  backtracking  points,
4861           and  so  the  entire  match fails. (Perl is able, at this point, to re-
4862           enter the recursion and try the second alternative.)  However,  if  the
4863           pattern is written with the alternatives in the other order, things are
4864           different:
4865    
4866             ^((.)(?1)\2|.)$
4867    
4868           This time, the recursing alternative is tried first, and  continues  to
4869           recurse  until  it runs out of characters, at which point the recursion
4870           fails. But this time we do have  another  alternative  to  try  at  the
4871           higher  level.  That  is  the  big difference: in the previous case the
4872           remaining alternative is at a deeper recursion level, which PCRE cannot
4873           use.
4874    
4875           To change the pattern so that matches all palindromic strings, not just
4876           those with an odd number of characters, it is tempting  to  change  the
4877           pattern to this:
4878    
4879             ^((.)(?1)\2|.?)$
4880    
4881           Again,  this  works  in Perl, but not in PCRE, and for the same reason.
4882           When a deeper recursion has matched a single character,  it  cannot  be
4883           entered  again  in  order  to match an empty string. The solution is to
4884           separate the two cases, and write out the odd and even cases as  alter-
4885           natives at the higher level:
4886    
4887             ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4888    
4889           If  you  want  to match typical palindromic phrases, the pattern has to
4890           ignore all non-word characters, which can be done like this:
4891    
4892             ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$
4893    
4894           If run with the PCRE_CASELESS option, this pattern matches phrases such
4895           as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
4896           Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
4897           ing  into  sequences of non-word characters. Without this, PCRE takes a
4898           great deal longer (ten times or more) to  match  typical  phrases,  and
4899           Perl takes so long that you think it has gone into a loop.
4900    
4901    
4902  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
4903    
4904         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4905         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
4906         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
4907         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
4908         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
4909    
# Line 4827  SUBPATTERNS AS SUBROUTINES Line 4915  SUBPATTERNS AS SUBROUTINES
4915    
4916           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4917    
4918         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
4919         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
4920    
4921           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
4922    
4923         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
4924         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
4925         above.         above.
4926    
4927         Like recursive subpatterns, a "subroutine" call is always treated as an         Like recursive subpatterns, a "subroutine" call is always treated as an
4928         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
4929         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
4930         there is a subsequent matching failure.         there is a subsequent matching failure.
4931    
4932         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
4933         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4934         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4935    
4936           (abc)(?i:(?-1))           (abc)(?i:(?-1))
4937    
4938         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
4939         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
4940    
4941    
4942  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
4943    
4944         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
4945         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
4946         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
4947         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
4948         ten using this syntax:         ten using this syntax:
4949    
4950           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
4951           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
4952    
4953         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
4954         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
4955    
4956           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
4957    
4958         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
4959         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
4960         call.         call.
4961    
4962    
4963  CALLOUTS  CALLOUTS
4964    
4965         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4966         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
4967         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
4968         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
4969         tion.         tion.
4970    
4971         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
4972         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
4973         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
4974         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
4975         all calling out.         all calling out.
4976    
4977         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
4978         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
4979         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
4980         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
4981         points:         points:
4982    
4983           (?C1)abc(?C2)def           (?C1)abc(?C2)def
4984    
4985         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4986         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
4987         numbered 255.         numbered 255.
4988    
4989         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
4990         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
4991         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
4992         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
4993         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
4994         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4995         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4996    
4997    
4998  BACKTRACKING CONTROL  BACKTRACKING CONTROL
4999    
5000         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5001         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5002         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5003         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5004         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5005         in this section.         in this section.
5006    
5007         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5008         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5009         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5010         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5011         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5012    
5013           If any of these verbs are used in an assertion subpattern, their effect
5014           is  confined  to that subpattern; it does not extend to the surrounding
5015           pattern.  Note that assertion subpatterns are processed as anchored  at
5016           the point where they are tested.
5017    
5018         The  new verbs make use of what was previously invalid syntax: an open-         The  new verbs make use of what was previously invalid syntax: an open-
5019         ing parenthesis followed by an asterisk. In Perl, they are generally of         ing parenthesis followed by an asterisk. In Perl, they are generally of
5020         the form (*VERB:ARG) but PCRE does not support the use of arguments, so         the form (*VERB:ARG) but PCRE does not support the use of arguments, so
# Line 4936  BACKTRACKING CONTROL Line 5029  BACKTRACKING CONTROL
5029    
5030         This  verb causes the match to end successfully, skipping the remainder         This  verb causes the match to end successfully, skipping the remainder
5031         of the pattern. When inside a recursion, only the innermost pattern  is         of the pattern. When inside a recursion, only the innermost pattern  is
5032         ended  immediately.  PCRE  differs  from  Perl  in  what happens if the         ended  immediately.  If  the (*ACCEPT) is inside capturing parentheses,
5033         (*ACCEPT) is inside capturing parentheses. In Perl, the data so far  is         the data so far is captured. (This feature was added to PCRE at release
5034         captured: in PCRE no data is captured. For example:         8.00.) For example:
5035    
5036           A(A|B(*ACCEPT)|C)D           A((?:A|B(*ACCEPT)|C)D)
5037    
5038         This  matches  "AB", "AAD", or "ACD", but when it matches "AB", no data         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
5039         is captured.         tured by the outer parentheses.
5040    
5041           (*FAIL) or (*F)           (*FAIL) or (*F)
5042    
# Line 5039  AUTHOR Line 5132  AUTHOR
5132    
5133  REVISION  REVISION
5134    
5135         Last updated: 11 April 2009         Last updated: 18 September 2009
5136         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5137  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5138    
# Line 5453  PARTIAL MATCHING USING pcre_exec() Line 5546  PARTIAL MATCHING USING pcre_exec()
5546         If PCRE_PARTIAL_SOFT is set,  the  partial  match  is  remembered,  but         If PCRE_PARTIAL_SOFT is set,  the  partial  match  is  remembered,  but
5547         matching continues as normal, and other alternatives in the pattern are         matching continues as normal, and other alternatives in the pattern are
5548         tried.  If  no  complete  match  can  be  found,  pcre_exec()   returns         tried.  If  no  complete  match  can  be  found,  pcre_exec()   returns
5549         PCRE_ERROR_PARTIAL  instead  of PCRE_ERROR_NOMATCH, and if there are at         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
5550         least two slots in the offsets vector, they are filled in with the off-         two slots in the offsets vector, the first of them is set to the offset
5551         sets  of  the longest string that partially matched. Consider this pat-         of the earliest character that was inspected when the partial match was
5552         tern:         found. For convenience, the second offset points  to  the  end  of  the
5553           string so that a substring can easily be extracted.
5554    
5555           For  the majority of patterns, the first offset identifies the start of
5556           the partially matched string. However, for patterns that contain  look-
5557           behind  assertions,  or  \K, or begin with \b or \B, earlier characters
5558           have been inspected while carrying out the match. For example:
5559    
5560             /(?<=abc)123/
5561    
5562           This pattern matches "123", but only if it is preceded by "abc". If the
5563           subject string is "xyzabc12", the offsets after a partial match are for
5564           the substring "abc12", because  all  these  characters  are  needed  if
5565           another match is tried with extra characters added.
5566    
5567           If  there  is more than one partial match, the first one that was found
5568           provides the data that is returned. Consider this pattern:
5569    
5570           /123\w+X|dogY/           /123\w+X|dogY/
5571    
# Line 5464  PARTIAL MATCHING USING pcre_exec() Line 5573  PARTIAL MATCHING USING pcre_exec()
5573         natives  fail  to  match,  but the end of the subject is reached during         natives  fail  to  match,  but the end of the subject is reached during
5574         matching,   so    PCRE_ERROR_PARTIAL    is    returned    instead    of         matching,   so    PCRE_ERROR_PARTIAL    is    returned    instead    of
5575         PCRE_ERROR_NOMATCH.  The  offsets  are  set  to  3  and  9, identifying         PCRE_ERROR_NOMATCH.  The  offsets  are  set  to  3  and  9, identifying
5576         "123dog" as the longest partial match that was found. (In this example,         "123dog" as the first partial match that was found. (In  this  example,
5577         there  are  two  partial  matches,  because  "dog" on its own partially         there  are  two  partial  matches,  because  "dog" on its own partially
5578         matches the second alternative.)         matches the second alternative.)
5579    
# Line 5508  PARTIAL MATCHING USING pcre_dfa_exec() Line 5617  PARTIAL MATCHING USING pcre_dfa_exec()
5617         there  have  been  no complete matches. Otherwise, the complete matches         there  have  been  no complete matches. Otherwise, the complete matches
5618         are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match         are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
5619         takes  precedence  over any complete matches. The portion of the string         takes  precedence  over any complete matches. The portion of the string
5620         that provided the longest partial match is set as  the  first  matching         that was inspected when the longest partial match was found is  set  as
5621         string, provided there are at least two slots in the offsets vector.         the first matching string, provided there are at least two slots in the
5622           offsets vector.
5623    
5624         Because  pcre_dfa_exec()  always searches for all possible matches, and         Because pcre_dfa_exec() always searches for all possible  matches,  and
5625         there is no difference between greedy and ungreedy repetition, its  be-         there  is no difference between greedy and ungreedy repetition, its be-
5626         haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-         haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
5627         sider the string "dog"  matched  against  the  ungreedy  pattern  shown         sider  the  string  "dog"  matched  against  the ungreedy pattern shown
5628         above:         above:
5629    
5630           /dog(sbody)??/           /dog(sbody)??/
5631    
5632         Whereas  pcre_exec()  stops  as soon as it finds the complete match for         Whereas pcre_exec() stops as soon as it finds the  complete  match  for
5633         "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and         "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
5634         so returns that when PCRE_PARTIAL_HARD is set.         so returns that when PCRE_PARTIAL_HARD is set.
5635    
5636    
5637  PARTIAL MATCHING AND WORD BOUNDARIES  PARTIAL MATCHING AND WORD BOUNDARIES
5638    
5639         If  a  pattern ends with one of sequences \w or \W, which test for word         If a pattern ends with one of sequences \w or \W, which test  for  word
5640         boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-         boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
5641         intuitive results. Consider this pattern:         intuitive results. Consider this pattern:
5642    
5643           /\bcat\b/           /\bcat\b/
5644    
5645         This matches "cat", provided there is a word boundary at either end. If         This matches "cat", provided there is a word boundary at either end. If
5646         the subject string is "the cat", the comparison of the final "t" with a         the subject string is "the cat", the comparison of the final "t" with a
5647         following  character  cannot  take  place, so a partial match is found.         following character cannot take place, so a  partial  match  is  found.
5648         However, pcre_exec() carries on with normal matching, which matches  \b         However,  pcre_exec() carries on with normal matching, which matches \b
5649         at  the  end  of  the subject when the last character is a letter, thus         at the end of the subject when the last character  is  a  letter,  thus
5650         finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-         finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
5651         TIAL.  The  same  thing  happens  with pcre_dfa_exec(), because it also         TIAL. The same thing happens  with  pcre_dfa_exec(),  because  it  also
5652         finds the complete match.         finds the complete match.
5653    
5654         Using PCRE_PARTIAL_HARD in this  case  does  yield  PCRE_ERROR_PARTIAL,         Using  PCRE_PARTIAL_HARD  in  this  case does yield PCRE_ERROR_PARTIAL,
5655         because then the partial match takes precedence.         because then the partial match takes precedence.
5656    
5657    
5658  FORMERLY RESTRICTED PATTERNS  FORMERLY RESTRICTED PATTERNS
5659    
5660         For releases of PCRE prior to 8.00, because of the way certain internal         For releases of PCRE prior to 8.00, because of the way certain internal
5661         optimizations  were  implemented  in  the  pcre_exec()  function,   the         optimizations   were  implemented  in  the  pcre_exec()  function,  the
5662         PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be         PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
5663         used with all patterns. From release 8.00 onwards, the restrictions  no         used  with all patterns. From release 8.00 onwards, the restrictions no
5664         longer  apply,  and  partial matching with pcre_exec() can be requested         longer apply, and partial matching with pcre_exec()  can  be  requested
5665         for any pattern.         for any pattern.
5666    
5667         Items that were formerly restricted were repeated single characters and         Items that were formerly restricted were repeated single characters and
5668         repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did         repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
5669         not conform to the restrictions, pcre_exec() returned  the  error  code         not  conform  to  the restrictions, pcre_exec() returned the error code
5670         PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The         PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
5671         PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled         PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
5672         pattern can be used for partial matching now always returns 1.         pattern can be used for partial matching now always returns 1.
5673    
5674    
5675  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
5676    
5677         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If the escape sequence \P is present  in  a  pcretest  data  line,  the
5678         PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of         PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
5679         pcretest that uses the date example quoted above:         pcretest that uses the date example quoted above:
5680    
5681             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
# Line 5581  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 5691  EXAMPLE OF PARTIAL MATCHING USING PCRETE
5691           data> j\P           data> j\P
5692           No match           No match
5693    
5694         The  first  data  string  is  matched completely, so pcretest shows the         The first data string is matched  completely,  so  pcretest  shows  the
5695         matched substrings. The remaining four strings do not  match  the  com-         matched  substrings.  The  remaining four strings do not match the com-
5696         plete pattern, but the first two are partial matches. Similar output is         plete pattern, but the first two are partial matches. Similar output is
5697         obtained when pcre_dfa_exec() is used.         obtained when pcre_dfa_exec() is used.
5698    
5699         If the escape sequence \P is present more than once in a pcretest  data         If  the escape sequence \P is present more than once in a pcretest data
5700         line, the PCRE_PARTIAL_HARD option is set for the match.         line, the PCRE_PARTIAL_HARD option is set for the match.
5701    
5702    
5703  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
5704    
5705         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
5706         ble to continue the match by  providing  additional  subject  data  and         ble  to  continue  the  match  by providing additional subject data and
5707         calling  pcre_dfa_exec()  again  with the same compiled regular expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
5708         sion, this time setting the PCRE_DFA_RESTART option. You must pass  the         sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
5709         same working space as before, because this is where details of the pre-         same working space as before, because this is where details of the pre-
5710         vious partial match are stored. Here  is  an  example  using  pcretest,         vious  partial  match  are  stored.  Here is an example using pcretest,
5711         using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D         using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
5712         specifies the use of pcre_dfa_exec()):         specifies the use of pcre_dfa_exec()):
5713    
5714             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
# Line 5607  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 5717  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
5717           data> n05\R\D           data> n05\R\D
5718            0: n05            0: n05
5719    
5720         The first call has "23ja" as the subject, and requests  partial  match-         The  first  call has "23ja" as the subject, and requests partial match-
5721         ing;  the  second  call  has  "n05"  as  the  subject for the continued         ing; the second call  has  "n05"  as  the  subject  for  the  continued
5722         (restarted) match.  Notice that when the match is  complete,  only  the         (restarted)  match.   Notice  that when the match is complete, only the
5723         last  part  is  shown;  PCRE  does not retain the previously partially-         last part is shown; PCRE does  not  retain  the  previously  partially-
5724         matched string. It is up to the calling program to do that if it  needs         matched  string. It is up to the calling program to do that if it needs
5725         to.         to.
5726    
5727         You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with         You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
5728         PCRE_DFA_RESTART to continue partial matching over  multiple  segments.         PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
5729         This  facility  can  be  used  to  pass  very  long  subject strings to         This facility can  be  used  to  pass  very  long  subject  strings  to
5730         pcre_dfa_exec().         pcre_dfa_exec().
5731    
5732    
5733  MULTI-SEGMENT MATCHING WITH pcre_exec()  MULTI-SEGMENT MATCHING WITH pcre_exec()
5734    
5735         From release 8.00, pcre_exec() can also be  used  to  do  multi-segment         From  release  8.00,  pcre_exec()  can also be used to do multi-segment
5736         matching.  Unlike  pcre_dfa_exec(),  it  is not possible to restart the         matching. Unlike pcre_dfa_exec(), it is not  possible  to  restart  the
5737         previous match with a new segment of data. Instead, new  data  must  be         previous  match  with  a new segment of data. Instead, new data must be
5738         added  to  the  previous  subject  string, and the entire match re-run,         added to the previous subject string,  and  the  entire  match  re-run,
5739         starting from the point where the partial match occurred. Earlier  data         starting  from the point where the partial match occurred. Earlier data
5740         can be discarded.  Consider an unanchored pattern that matches dates:         can be discarded.  Consider an unanchored pattern that matches dates:
5741    
5742             re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/             re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
# Line 5634  MULTI-SEGMENT MATCHING WITH pcre_exec() Line 5744  MULTI-SEGMENT MATCHING WITH pcre_exec()
5744           Partial match: 23ja           Partial match: 23ja
5745    
5746         The this stage, an application could discard the text preceding "23ja",         The this stage, an application could discard the text preceding "23ja",
5747         add on text from the next segment, and call pcre_exec()  again.  Unlike         add  on  text from the next segment, and call pcre_exec() again. Unlike
5748         pcre_dfa_exec(),  the  entire matching string must always be available,         pcre_dfa_exec(), the entire matching string must always  be  available,
5749         and the complete matching process occurs for each call, so more  memory         and  the complete matching process occurs for each call, so more memory
5750         and more processing time is needed.         and more processing time is needed.
5751    
5752           Note: If the pattern contains lookbehind assertions, or \K,  or  starts
5753           with  \b  or  \B,  the string that is returned for a partial match will
5754           include characters that precede the partially  matched  string  itself,
5755           because  these  must  be  retained when adding on more characters for a
5756           subsequent matching attempt.
5757    
5758    
5759  ISSUES WITH MULTI-SEGMENT MATCHING  ISSUES WITH MULTI-SEGMENT MATCHING
5760    
5761         Certain types of pattern may give problems with multi-segment matching,         Certain types of pattern may give problems with multi-segment matching,
5762         whichever matching function is used.         whichever matching function is used.
5763    
5764         1. If the pattern contains tests for the beginning or end  of  a  line,         1.  If  the  pattern contains tests for the beginning or end of a line,
5765         you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
5766         ate, when the subject string for any call does not contain  the  begin-         ate,  when  the subject string for any call does not contain the begin-
5767         ning or end of a line.         ning or end of a line.
5768    
5769         2.  If  the  pattern contains backward assertions (including \b or \B),         2. Lookbehind assertions at the start of a pattern are catered  for  in
5770         you need to arrange for some overlap in the subject  strings  to  allow         the  offsets that are returned for a partial match. However, in theory,
5771         for  them  to  be  correctly tested at the start of each substring. For         a lookbehind assertion later in the pattern could require even  earlier
5772         example, using pcre_dfa_exec(), you could pass the  subject  in  chunks         characters  to  be inspected, and it might not have been reached when a
5773         that  are 500 bytes long, but in a buffer of 700 bytes, with the start-         partial match occurs. This is probably an extremely unlikely case;  you
5774         ing offset set to 200 and the previous 200 bytes at the  start  of  the         could  guard  against  it to a certain extent by always including extra
5775         buffer.         characters at the start.
5776    
5777         3.  Matching  a subject string that is split into multiple segments may         3. Matching a subject string that is split into multiple  segments  may
5778         not always produce exactly the same result as matching over one  single         not  always produce exactly the same result as matching over one single
5779         long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section         long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
5780         "Partial Matching and Word Boundaries" above describes  an  issue  that         "Partial  Matching  and  Word Boundaries" above describes an issue that
5781         arises  if  the  pattern ends with \b or \B. Another kind of difference         arises if the pattern ends with \b or \B. Another  kind  of  difference
5782         may occur when there are multiple  matching  possibilities,  because  a         may  occur  when  there  are multiple matching possibilities, because a
5783         partial match result is given only when there are no completed matches.         partial match result is given only when there are no completed matches.
5784         This means that as soon as the shortest match has been found, continua-         This means that as soon as the shortest match has been found, continua-
5785         tion  to  a  new subject segment is no longer possible.  Consider again         tion to a new subject segment is no longer  possible.   Consider  again
5786         this pcretest example:         this pcretest example:
5787    
5788             re> /dog(sbody)?/             re> /dog(sbody)?/
# Line 5680  ISSUES WITH MULTI-SEGMENT MATCHING Line 5796  ISSUES WITH MULTI-SEGMENT MATCHING
5796            0: dogsbody            0: dogsbody
5797            1: dog            1: dog
5798    
5799         The first data line passes the string "dogsb" to  pcre_exec(),  setting         The  first  data line passes the string "dogsb" to pcre_exec(), setting
5800         the  PCRE_PARTIAL_SOFT  option.  Although the string is a partial match         the PCRE_PARTIAL_SOFT option. Although the string is  a  partial  match
5801         for "dogsbody", the  result  is  not  PCRE_ERROR_PARTIAL,  because  the         for  "dogsbody",  the  result  is  not  PCRE_ERROR_PARTIAL, because the
5802         shorter  string  "dog" is a complete match. Similarly, when the subject         shorter string "dog" is a complete match. Similarly, when  the  subject
5803         is presented to pcre_dfa_exec() in several parts ("do" and "gsb"  being         is  presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
5804         the first two) the match stops when "dog" has been found, and it is not         the first two) the match stops when "dog" has been found, and it is not
5805         possible to continue. On the other hand, if "dogsbody" is presented  as         possible  to continue. On the other hand, if "dogsbody" is presented as
5806         a single string, pcre_dfa_exec() finds both matches.         a single string, pcre_dfa_exec() finds both matches.
5807    
5808         Because of these problems, it is probably best to use PCRE_PARTIAL_HARD         Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
5809         when matching multi-segment data. The example above then  behaves  dif-         when  matching  multi-segment data. The example above then behaves dif-
5810         ferently:         ferently:
5811    
5812             re> /dog(sbody)?/             re> /dog(sbody)?/
# Line 5703  ISSUES WITH MULTI-SEGMENT MATCHING Line 5819  ISSUES WITH MULTI-SEGMENT MATCHING
5819    
5820    
5821         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
5822         start with the  same  pattern  item  may  not  work  as  expected  when         start  with  the  same  pattern  item  may  not  work  as expected when
5823         pcre_dfa_exec() is used. For example, consider this pattern:         pcre_dfa_exec() is used. For example, consider this pattern:
5824    
5825           1234|3789           1234|3789
5826    
5827         If  the  first  part of the subject is "ABC123", a partial match of the         If the first part of the subject is "ABC123", a partial  match  of  the
5828         first alternative is found at offset 3. There is no partial  match  for         first  alternative  is found at offset 3. There is no partial match for
5829         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
5830         point in the subject string. Attempting to  continue  with  the  string         point  in  the  subject  string. Attempting to continue with the string
5831         "7890"  does  not  yield  a  match because only those alternatives that         "7890" does not yield a match  because  only  those  alternatives  that
5832         match at one point in the subject are remembered.  The  problem  arises         match  at  one  point in the subject are remembered. The problem arises
5833         because  the  start  of the second alternative matches within the first         because the start of the second alternative matches  within  the  first
5834         alternative. There is no problem with  anchored  patterns  or  patterns         alternative.  There  is  no  problem with anchored patterns or patterns
5835         such as:         such as:
5836    
5837           1234|ABCD           1234|ABCD
5838    
5839         where  no  string can be a partial match for both alternatives. This is         where no string can be a partial match for both alternatives.  This  is
5840         not a problem if pcre_exec() is used, because the entire match  has  to         not  a  problem if pcre_exec() is used, because the entire match has to
5841         be rerun each time:         be rerun each time:
5842    
5843             re> /1234|3789/             re> /1234|3789/
# Line 5740  AUTHOR Line 5856  AUTHOR
5856    
5857  REVISION  REVISION
5858    
5859         Last updated: 31 August 2009         Last updated: 05 September 2009
5860         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5861  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5862    
# Line 6062  DESCRIPTION Line 6178  DESCRIPTION
6178         easier to slot in PCRE as a replacement library.  Other  POSIX  options         easier to slot in PCRE as a replacement library.  Other  POSIX  options
6179         are not even defined.         are not even defined.
6180    
6181           There  are also some other options that are not defined by POSIX. These
6182           have been added at the request of users who want to make use of certain
6183           PCRE-specific features via the POSIX calling interface.
6184    
6185         When  PCRE  is  called  via these functions, it is only the API that is         When  PCRE  is  called  via these functions, it is only the API that is
6186         POSIX-like in style. The syntax and semantics of  the  regular  expres-         POSIX-like in style. The syntax and semantics of  the  regular  expres-
6187         sions  themselves  are  still  those of Perl, subject to the setting of         sions  themselves  are  still  those of Perl, subject to the setting of
# Line 6116  COMPILING A PATTERN Line 6236  COMPILING A PATTERN
6236         ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured         ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
6237         strings are returned.         strings are returned.
6238    
6239             REG_UNGREEDY
6240    
6241           The PCRE_UNGREEDY option is set when the regular expression  is  passed
6242           for  compilation  to the native function. Note that REG_UNGREEDY is not
6243           part of the POSIX standard.
6244    
6245           REG_UTF8           REG_UTF8
6246    
6247         The PCRE_UTF8 option is set when the regular expression is  passed  for         The PCRE_UTF8 option is set when the regular expression is  passed  for
# Line 6128  COMPILING A PATTERN Line 6254  COMPILING A PATTERN
6254         semantics.  In particular, the way it handles newline characters in the         semantics.  In particular, the way it handles newline characters in the
6255         subject string is the Perl way, not the POSIX way.  Note  that  setting         subject string is the Perl way, not the POSIX way.  Note  that  setting
6256         PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.         PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
6257         It does not affect the way newlines are matched by . (they  aren't)  or         It does not affect the way newlines are matched by . (they are not)  or
6258         by a negative class such as [^a] (they are).         by a negative class such as [^a] (they are).
6259    
6260         The  yield of regcomp() is zero on success, and non-zero otherwise. The         The  yield of regcomp() is zero on success, and non-zero otherwise. The
# Line 6215  MATCHING A PATTERN Line 6341  MATCHING A PATTERN
6341         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
6342         regexec() are ignored.         regexec() are ignored.
6343    
6344           If the value of nmatch is zero, or if the value pmatch is NULL, no data
6345           about any matched strings is returned.
6346    
6347         Otherwise,the portion of the string that was matched, and also any cap-         Otherwise,the portion of the string that was matched, and also any cap-
6348         tured substrings, are returned via the pmatch argument, which points to         tured substrings, are returned via the pmatch argument, which points to
6349         an  array  of nmatch structures of type regmatch_t, containing the mem-         an array of nmatch structures of type regmatch_t, containing  the  mem-
6350         bers rm_so and rm_eo. These contain the offset to the  first  character         bers  rm_so  and rm_eo. These contain the offset to the first character
6351         of  each  substring and the offset to the first character after the end         of each substring and the offset to the first character after  the  end
6352         of each substring, respectively. The 0th element of the vector  relates         of  each substring, respectively. The 0th element of the vector relates
6353         to  the  entire portion of string that was matched; subsequent elements         to the entire portion of string that was matched;  subsequent  elements
6354         relate to the capturing subpatterns of the regular  expression.  Unused         relate  to  the capturing subpatterns of the regular expression. Unused
6355         entries in the array have both structure members set to -1.         entries in the array have both structure members set to -1.
6356    
6357         A  successful  match  yields  a  zero  return;  various error codes are         A successful match yields  a  zero  return;  various  error  codes  are
6358         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"         defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
6359         failure code.         failure code.
6360    
6361    
6362  ERROR MESSAGES  ERROR MESSAGES
6363    
6364         The regerror() function maps a non-zero errorcode from either regcomp()         The regerror() function maps a non-zero errorcode from either regcomp()
6365         or regexec() to a printable message. If preg is  not  NULL,  the  error         or  regexec()  to  a  printable message. If preg is not NULL, the error
6366         should have arisen from the use of that structure. A message terminated         should have arisen from the use of that structure. A message terminated
6367         by a binary zero is placed  in  errbuf.  The  length  of  the  message,         by  a  binary  zero  is  placed  in  errbuf. The length of the message,
6368         including  the  zero, is limited to errbuf_size. The yield of the func-         including the zero, is limited to errbuf_size. The yield of  the  func-
6369         tion is the size of buffer needed to hold the whole message.         tion is the size of buffer needed to hold the whole message.
6370    
6371    
6372  MEMORY USAGE  MEMORY USAGE
6373    
6374         Compiling a regular expression causes memory to be allocated and  asso-         Compiling  a regular expression causes memory to be allocated and asso-
6375         ciated  with  the preg structure. The function regfree() frees all such         ciated with the preg structure. The function regfree() frees  all  such
6376         memory, after which preg may no longer be used as  a  compiled  expres-         memory,  after  which  preg may no longer be used as a compiled expres-
6377         sion.         sion.
6378    
6379    
# Line 6257  AUTHOR Line 6386  AUTHOR
6386    
6387  REVISION  REVISION
6388    
6389         Last updated: 15 August 2009         Last updated: 02 September 2009
6390         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
6391  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6392    

Legend:
Removed from v.429  
changed lines
  Added in v.453

  ViewVC Help
Powered by ViewVC 1.1.5