/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 123 by ph10, Mon Mar 12 15:19:06 2007 UTC revision 150 by ph10, Tue Apr 17 08:22:40 2007 UTC
# Line 244  PCRE BUILD-TIME OPTIONS Line 244  PCRE BUILD-TIME OPTIONS
244    
245           ./configure --help           ./configure --help
246    
247         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
248         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
249         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
250         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
251         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
252         not described.         is not described.
253    
254    
255  C++ SUPPORT  C++ SUPPORT
# Line 288  UNICODE CHARACTER PROPERTY SUPPORT Line 288  UNICODE CHARACTER PROPERTY SUPPORT
288         to the configure command. This implies UTF-8 support, even if you  have         to the configure command. This implies UTF-8 support, even if you  have
289         not explicitly requested it.         not explicitly requested it.
290    
291         Including  Unicode  property  support  adds around 90K of tables to the         Including  Unicode  property  support  adds around 30K of tables to the
292         PCRE library, approximately doubling its size. Only the  general  cate-         PCRE library. Only the general category properties such as  Lu  and  Nd
293         gory  properties  such as Lu and Nd are supported. Details are given in         are supported. Details are given in the pcrepattern documentation.
        the pcrepattern documentation.  
294    
295    
296  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
297    
298         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating
299         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
300         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use character 13 (carriage return, CR)
301         instead, by adding         instead, by adding
302    
303           --enable-newline-is-cr           --enable-newline-is-cr
304    
305         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
306         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
307    
308         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 313  CODE VALUE OF NEWLINE Line 312  CODE VALUE OF NEWLINE
312    
313         to the configure command. There is a fourth option, specified by         to the configure command. There is a fourth option, specified by
314    
315             --enable-newline-is-anycrlf
316    
317           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
318           CRLF as indicating a line ending. Finally, a fifth option, specified by
319    
320           --enable-newline-is-any           --enable-newline-is-any
321    
322         which causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
323    
324         Whatever  line  ending convention is selected when PCRE is built can be         Whatever line ending convention is selected when PCRE is built  can  be
325         overridden when the library functions are called. At build time  it  is         overridden  when  the library functions are called. At build time it is
326         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
327    
328    
329  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
330    
331         The  PCRE building process uses libtool to build both shared and static         The PCRE building process uses libtool to build both shared and  static
332         Unix libraries by default. You can suppress one of these by adding  one         Unix  libraries by default. You can suppress one of these by adding one
333         of         of
334    
335           --disable-shared           --disable-shared
# Line 337  BUILDING SHARED AND STATIC LIBRARIES Line 341  BUILDING SHARED AND STATIC LIBRARIES
341  POSIX MALLOC USAGE  POSIX MALLOC USAGE
342    
343         When PCRE is called through the POSIX interface (see the pcreposix doc-         When PCRE is called through the POSIX interface (see the pcreposix doc-
344         umentation), additional working storage is  required  for  holding  the         umentation),  additional  working  storage  is required for holding the
345         pointers  to capturing substrings, because PCRE requires three integers         pointers to capturing substrings, because PCRE requires three  integers
346         per substring, whereas the POSIX interface provides only  two.  If  the         per  substring,  whereas  the POSIX interface provides only two. If the
347         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
348         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
349         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 352  POSIX MALLOC USAGE Line 356  POSIX MALLOC USAGE
356    
357  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
358    
359         Within a compiled pattern, offset values are used  to  point  from  one         Within  a  compiled  pattern,  offset values are used to point from one
360         part  to another (for example, from an opening parenthesis to an alter-         part to another (for example, from an opening parenthesis to an  alter-
361         nation metacharacter). By default, two-byte values are used  for  these         nation  metacharacter).  By default, two-byte values are used for these
362         offsets,  leading  to  a  maximum size for a compiled pattern of around         offsets, leading to a maximum size for a  compiled  pattern  of  around
363         64K. This is sufficient to handle all but the most  gigantic  patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
364         Nevertheless,  some  people do want to process enormous patterns, so it         Nevertheless, some people do want to process enormous patterns,  so  it
365         is possible to compile PCRE to use three-byte or four-byte  offsets  by         is  possible  to compile PCRE to use three-byte or four-byte offsets by
366         adding a setting such as         adding a setting such as
367    
368           --with-link-size=3           --with-link-size=3
369    
370         to  the  configure  command.  The value given must be 2, 3, or 4. Using         to the configure command. The value given must be 2,  3,  or  4.  Using
371         longer offsets slows down the operation of PCRE because it has to  load         longer  offsets slows down the operation of PCRE because it has to load
372         additional bytes when handling them.         additional bytes when handling them.
373    
        If  you  build  PCRE with an increased link size, test 2 (and test 5 if  
        you are using UTF-8) will fail. Part of the output of these tests is  a  
        representation  of the compiled pattern, and this changes with the link  
        size.  
   
374    
375  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
376    
# Line 429  LIMITING PCRE RESOURCE USAGE Line 428  LIMITING PCRE RESOURCE USAGE
428         time.         time.
429    
430    
431    CREATING CHARACTER TABLES AT BUILD TIME
432    
433           PCRE uses fixed tables for processing characters whose code values  are
434           less  than 256. By default, PCRE is built with a set of tables that are
435           distributed in the file pcre_chartables.c.dist. These  tables  are  for
436           ASCII codes only. If you add
437    
438             --enable-rebuild-chartables
439    
440           to  the  configure  command, the distributed tables are no longer used.
441           Instead, a program called dftables is compiled and  run.  This  outputs
442           the source for new set of tables, created in the default locale of your
443           C runtime system. (This method of replacing the tables does not work if
444           you  are cross compiling, because dftables is run on the local host. If
445           you need to create alternative tables when cross  compiling,  you  will
446           have to do so "by hand".)
447    
448    
449  USING EBCDIC CODE  USING EBCDIC CODE
450    
451         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
452         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
453         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by
454         adding         adding
455    
456           --enable-ebcdic           --enable-ebcdic
457    
458         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
459           bles.
460    
461    
462  SEE ALSO  SEE ALSO
# Line 455  AUTHOR Line 473  AUTHOR
473    
474  REVISION  REVISION
475    
476         Last updated: 06 March 2007         Last updated: 16 April 2007
477         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
478  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
479    
# Line 508  REGULAR EXPRESSIONS AS TREES Line 526  REGULAR EXPRESSIONS AS TREES
526    
527  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
528    
529         In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
530         sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
531         depth-first search of the pattern tree. That is, it  proceeds  along  a         depth-first search of the pattern tree. That is, it  proceeds  along  a
532         single path through the tree, checking that the subject matches what is         single path through the tree, checking that the subject matches what is
533         required. When there is a mismatch, the algorithm  tries  any  alterna-         required. When there is a mismatch, the algorithm  tries  any  alterna-
# Line 828  PCRE API OVERVIEW Line 846  PCRE API OVERVIEW
846    
847  NEWLINES  NEWLINES
848    
849         PCRE  supports four different conventions for indicating line breaks in         PCRE  supports five different conventions for indicating line breaks in
850         strings: a single CR (carriage return) character, a  single  LF  (line-         strings: a single CR (carriage return) character, a  single  LF  (line-
851         feed)  character,  the two-character sequence CRLF, or any Unicode new-         feed) character, the two-character sequence CRLF, any of the three pre-
852         line sequence.  The Unicode newline sequences are the three  just  men-         ceding, or any Unicode newline sequence. The Unicode newline  sequences
853         tioned, plus the single characters VT (vertical tab, U+000B), FF (form-         are  the  three just mentioned, plus the single characters VT (vertical
854         feed, U+000C), NEL (next line, U+0085), LS  (line  separator,  U+2028),         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
855         and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
856    
857         Each  of  the first three conventions is used by at least one operating         Each  of  the first three conventions is used by at least one operating
858         system as its standard newline sequence. When PCRE is built, a  default         system as its standard newline sequence. When PCRE is built, a  default
# Line 899  CHECKING BUILD-TIME OPTIONS Line 917  CHECKING BUILD-TIME OPTIONS
917    
918         The output is an integer whose value specifies  the  default  character         The output is an integer whose value specifies  the  default  character
919         sequence  that is recognized as meaning "newline". The four values that         sequence  that is recognized as meaning "newline". The four values that
920         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY.         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
921         The default should normally be the standard sequence for your operating         and  -1  for  ANY. The default should normally be the standard sequence
922         system.         for your operating system.
923    
924           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
925    
# Line 1125  COMPILING A PATTERN Line 1143  COMPILING A PATTERN
1143           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1144           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1145           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1146             PCRE_NEWLINE_ANYCRLF
1147           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1148    
1149         These  options  override the default newline definition that was chosen         These  options  override the default newline definition that was chosen
1150         when PCRE was built. Setting the first or the second specifies  that  a         when PCRE was built. Setting the first or the second specifies  that  a
1151         newline  is  indicated  by a single character (CR or LF, respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1152         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1153         two-character  CRLF  sequence.  Setting PCRE_NEWLINE_ANY specifies that         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1154         any Unicode newline sequence should be recognized. The Unicode  newline         that any of the three preceding sequences should be recognized. Setting
1155         sequences  are  the three just mentioned, plus the single characters VT         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1156         (vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085),         recognized. The Unicode newline sequences are the three just mentioned,
1157         LS  (line separator, U+2028), and PS (paragraph separator, U+2029). The         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1158         last two are recognized only in UTF-8 mode.         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1159           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1160           UTF-8 mode.
1161    
1162         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
1163         treated  as  a  number, giving eight possibilities. Currently only five         treated as a number, giving eight possibilities. Currently only six are
1164         are used (default plus the four values above). This means that  if  you         used (default plus the five values above). This means that if  you  set
1165         set  more  than  one  newline option, the combination may or may not be         more  than one newline option, the combination may or may not be sensi-
1166         sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is  equiva-         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1167         lent  to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1168         and cause an error.         cause an error.
1169    
1170         The only time that a line break is specially recognized when  compiling         The only time that a line break is specially recognized when  compiling
1171         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
# Line 1310  STUDYING A PATTERN Line 1331  STUDYING A PATTERN
1331  LOCALE SUPPORT  LOCALE SUPPORT
1332    
1333         PCRE handles caseless matching, and determines whether  characters  are         PCRE handles caseless matching, and determines whether  characters  are
1334         letters  digits,  or whatever, by reference to a set of tables, indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1335         by character value. When running in UTF-8 mode, this  applies  only  to         by character value. When running in UTF-8 mode, this  applies  only  to
1336         characters  with  codes  less than 128. Higher-valued codes never match         characters  with  codes  less than 128. Higher-valued codes never match
1337         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1338         with  Unicode  character property support. The use of locales with Uni-         with  Unicode  character property support. The use of locales with Uni-
1339         code is discouraged.         code is discouraged. If you are handling characters with codes  greater
1340           than  128, you should either use UTF-8 and Unicode, or use locales, but
1341         An internal set of tables is created in the default C locale when  PCRE         not try to mix the two.
1342         is  built.  This  is  used when the final argument of pcre_compile() is  
1343         NULL, and is sufficient for many applications. An  alternative  set  of         PCRE contains an internal set of tables that are used  when  the  final
1344         tables  can,  however, be supplied. These may be created in a different         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1345         locale from the default. As more and more applications change to  using         applications.  Normally, the internal tables recognize only ASCII char-
1346         Unicode, the need for this locale support is expected to die away.         acters. However, when PCRE is built, it is possible to cause the inter-
1347           nal tables to be rebuilt in the default "C" locale of the local system,
1348         External  tables  are  built by calling the pcre_maketables() function,         which may cause them to be different.
1349         which has no arguments, in the relevant locale. The result can then  be  
1350         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         The  internal tables can always be overridden by tables supplied by the
1351         example, to build and use tables that are appropriate  for  the  French         application that calls PCRE. These may be created in a different locale
1352         locale  (where  accented  characters  with  values greater than 128 are         from  the  default.  As more and more applications change to using Uni-
1353           code, the need for this locale support is expected to die away.
1354    
1355           External tables are built by calling  the  pcre_maketables()  function,
1356           which  has no arguments, in the relevant locale. The result can then be
1357           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1358           example,  to  build  and use tables that are appropriate for the French
1359           locale (where accented characters with  values  greater  than  128  are
1360         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1361    
1362           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1363           tables = pcre_maketables();           tables = pcre_maketables();
1364           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1365    
1366           The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1367           if you are using Windows, the name for the French locale is "french".
1368    
1369         When pcre_maketables() runs, the tables are built  in  memory  that  is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1370         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1371         that the memory containing the tables remains available for as long  as         that the memory containing the tables remains available for as long  as
# Line 1702  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1733  MATCHING A PATTERN: THE TRADITIONAL FUNC
1733           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1734           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1735           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1736             PCRE_NEWLINE_ANYCRLF
1737           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1738    
1739         These options override  the  newline  definition  that  was  chosen  or         These options override  the  newline  definition  that  was  chosen  or
# Line 1709  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1741  MATCHING A PATTERN: THE TRADITIONAL FUNC
1741         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion of pcre_compile()  above.  During  matching,  the  newline  choice
1742         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
1743         ters. It may also alter the way the match position is advanced after  a         ters. It may also alter the way the match position is advanced after  a
1744         match  failure  for  an  unanchored  pattern. When PCRE_NEWLINE_CRLF or         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1745         PCRE_NEWLINE_ANY is set, and a match attempt  fails  when  the  current         PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a  match  attempt
1746         position  is  at a CRLF sequence, the match position is advanced by two         fails  when the current position is at a CRLF sequence, the match posi-
1747         characters instead of one, in other words, to after the CRLF.         tion is advanced by two characters instead of one, in other  words,  to
1748           after the CRLF.
1749    
1750           PCRE_NOTBOL           PCRE_NOTBOL
1751    
1752         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1753         the  beginning  of  a  line, so the circumflex metacharacter should not         the beginning of a line, so the  circumflex  metacharacter  should  not
1754         match before it. Setting this without PCRE_MULTILINE (at compile  time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1755         causes  circumflex  never to match. This option affects only the behav-         causes circumflex never to match. This option affects only  the  behav-
1756         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1757    
1758           PCRE_NOTEOL           PCRE_NOTEOL
1759    
1760         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
1761         of  a line, so the dollar metacharacter should not match it nor (except         of a line, so the dollar metacharacter should not match it nor  (except
1762         in multiline mode) a newline immediately before it. Setting this  with-         in  multiline mode) a newline immediately before it. Setting this with-
1763         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1764         option affects only the behaviour of the dollar metacharacter. It  does         option  affects only the behaviour of the dollar metacharacter. It does
1765         not affect \Z or \z.         not affect \Z or \z.
1766    
1767           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1768    
1769         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1770         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
1771         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
1772         example, if the pattern         example, if the pattern
1773    
1774           a?b?           a?b?
1775    
1776         is applied to a string not beginning with "a" or "b",  it  matches  the         is  applied  to  a string not beginning with "a" or "b", it matches the
1777         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
1778         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1779         rences of "a" or "b".         rences of "a" or "b".
1780    
1781         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1782         cial case of a pattern match of the empty  string  within  its  split()         cial  case  of  a  pattern match of the empty string within its split()
1783         function,  and  when  using  the /g modifier. It is possible to emulate         function, and when using the /g modifier. It  is  possible  to  emulate
1784         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1785         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1786         if that fails by advancing the starting offset (see below)  and  trying         if  that  fails by advancing the starting offset (see below) and trying
1787         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
1788         this in the pcredemo.c sample program.         this in the pcredemo.c sample program.
1789    
1790           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1791    
1792         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1793         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8 string is automatically checked when pcre_exec() is  subsequently
1794         called.  The value of startoffset is also checked  to  ensure  that  it         called.   The  value  of  startoffset is also checked to ensure that it
1795         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence         points to the start of a UTF-8 character. If an invalid UTF-8  sequence
1796         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1797         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is         startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is
1798         returned.         returned.
1799    
1800         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
1801         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
1802         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1803         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
1804         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
1805         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
1806         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1807         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1808         value of startoffset that does not point to the start of a UTF-8  char-         value  of startoffset that does not point to the start of a UTF-8 char-
1809         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
1810    
1811           PCRE_PARTIAL           PCRE_PARTIAL
1812    
1813         This  option  turns  on  the  partial  matching feature. If the subject         This option turns on the  partial  matching  feature.  If  the  subject
1814         string fails to match the pattern, but at some point during the  match-         string  fails to match the pattern, but at some point during the match-
1815         ing  process  the  end of the subject was reached (that is, the subject         ing process the end of the subject was reached (that  is,  the  subject
1816         partially matches the pattern and the failure to  match  occurred  only         partially  matches  the  pattern and the failure to match occurred only
1817         because  there were not enough subject characters), pcre_exec() returns         because there were not enough subject characters), pcre_exec()  returns
1818         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1819         used,  there  are restrictions on what may appear in the pattern. These         used, there are restrictions on what may appear in the  pattern.  These
1820         are discussed in the pcrepartial documentation.         are discussed in the pcrepartial documentation.
1821    
1822     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
1823    
1824         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
1825         length  in  length, and a starting byte offset in startoffset. In UTF-8         length in length, and a starting byte offset in startoffset.  In  UTF-8
1826         mode, the byte offset must point to the start  of  a  UTF-8  character.         mode,  the  byte  offset  must point to the start of a UTF-8 character.
1827         Unlike  the  pattern string, the subject may contain binary zero bytes.         Unlike the pattern string, the subject may contain binary  zero  bytes.
1828         When the starting offset is zero, the search for a match starts at  the         When  the starting offset is zero, the search for a match starts at the
1829         beginning of the subject, and this is by far the most common case.         beginning of the subject, and this is by far the most common case.
1830    
1831         A  non-zero  starting offset is useful when searching for another match         A non-zero starting offset is useful when searching for  another  match
1832         in the same subject by calling pcre_exec() again after a previous  suc-         in  the same subject by calling pcre_exec() again after a previous suc-
1833         cess.   Setting  startoffset differs from just passing over a shortened         cess.  Setting startoffset differs from just passing over  a  shortened
1834         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
1835         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1836    
1837           \Biss\B           \Biss\B
1838    
1839         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
1840         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
1841         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
1842         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
1843         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
1844         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1845         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
1846         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
1847         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
1848         discover that it is preceded by a letter.         discover that it is preceded by a letter.
1849    
1850         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
1851         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
1852         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
1853         subject.         subject.
1854    
1855     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
1856    
1857         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
1858         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
1859         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
1860         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
1861         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
1862         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
1863         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1864    
1865         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of  integer
1866         offsets whose address is passed in ovector. The number of  elements  in         offsets  whose  address is passed in ovector. The number of elements in
1867         the  vector is passed in ovecsize, which must be a non-negative number.         the vector is passed in ovecsize, which must be a non-negative  number.
1868         Note: this argument is NOT the size of ovector in bytes.         Note: this argument is NOT the size of ovector in bytes.
1869    
1870         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
1871         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
1872         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
1873         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
1874         The length passed in ovecsize should always be a multiple of three.  If         The  length passed in ovecsize should always be a multiple of three. If
1875         it is not, it is rounded down.         it is not, it is rounded down.
1876    
1877         When  a  match  is successful, information about captured substrings is         When a match is successful, information about  captured  substrings  is
1878         returned in pairs of integers, starting at the  beginning  of  ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
1879         and  continuing  up  to two-thirds of its length at the most. The first         and continuing up to two-thirds of its length at the  most.  The  first
1880         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1881         string,  and  the  second  is  set to the offset of the first character         string, and the second is set to the  offset  of  the  first  character
1882         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1883         tor[1],  identify  the  portion  of  the  subject string matched by the         tor[1], identify the portion of  the  subject  string  matched  by  the
1884         entire pattern. The next pair is used for the first  capturing  subpat-         entire  pattern.  The next pair is used for the first capturing subpat-
1885         tern, and so on. The value returned by pcre_exec() is one more than the         tern, and so on. The value returned by pcre_exec() is one more than the
1886         highest numbered pair that has been set. For example, if two substrings         highest numbered pair that has been set. For example, if two substrings
1887         have  been captured, the returned value is 3. If there are no capturing         have been captured, the returned value is 3. If there are no  capturing
1888         subpatterns, the return value from a successful match is 1,  indicating         subpatterns,  the return value from a successful match is 1, indicating
1889         that just the first pair of offsets has been set.         that just the first pair of offsets has been set.
1890    
1891         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1892         of the string that it matched that is returned.         of the string that it matched that is returned.
1893    
1894         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
1895         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1896         function returns a value of zero. In particular, if the substring  off-         function  returns a value of zero. In particular, if the substring off-
1897         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1898         as NULL and ovecsize as zero. However, if  the  pattern  contains  back         as  NULL  and  ovecsize  as zero. However, if the pattern contains back
1899         references  and  the  ovector is not big enough to remember the related         references and the ovector is not big enough to  remember  the  related
1900         substrings, PCRE has to get additional memory for use during  matching.         substrings,  PCRE has to get additional memory for use during matching.
1901         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1902    
1903         The  pcre_info()  function  can  be used to find out how many capturing         The pcre_info() function can be used to find  out  how  many  capturing
1904         subpatterns there are in a compiled  pattern.  The  smallest  size  for         subpatterns  there  are  in  a  compiled pattern. The smallest size for
1905         ovector  that  will allow for n captured substrings, in addition to the         ovector that will allow for n captured substrings, in addition  to  the
1906         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1907    
1908         It is possible for capturing subpattern number n+1 to match  some  part         It  is  possible for capturing subpattern number n+1 to match some part
1909         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
1910         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1911         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
1912         2 is not. When this happens, both values in  the  offset  pairs  corre-         2  is  not.  When  this happens, both values in the offset pairs corre-
1913         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
1914    
1915         Offset  values  that correspond to unused subpatterns at the end of the         Offset values that correspond to unused subpatterns at the end  of  the
1916         expression are also set to -1. For example,  if  the  string  "abc"  is         expression  are  also  set  to  -1. For example, if the string "abc" is
1917         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
1918         matched. The return from the function is 2, because  the  highest  used         matched.  The  return  from the function is 2, because the highest used
1919         capturing subpattern number is 1. However, you can refer to the offsets         capturing subpattern number is 1. However, you can refer to the offsets
1920         for the second and third capturing subpatterns if  you  wish  (assuming         for  the  second  and third capturing subpatterns if you wish (assuming
1921         the vector is large enough, of course).         the vector is large enough, of course).
1922    
1923         Some  convenience  functions  are  provided for extracting the captured         Some convenience functions are provided  for  extracting  the  captured
1924         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
1925    
1926     Error return values from pcre_exec()     Error return values from pcre_exec()
1927    
1928         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
1929         defined in the header file:         defined in the header file:
1930    
1931           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1901  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1934  MATCHING A PATTERN: THE TRADITIONAL FUNC
1934    
1935           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1936    
1937         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1938         ovecsize was not zero.         ovecsize was not zero.
1939    
1940           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1910  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1943  MATCHING A PATTERN: THE TRADITIONAL FUNC
1943    
1944           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1945    
1946         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1947         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1948         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1949         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
1950         gives when the magic number is not present.         gives when the magic number is not present.
1951    
1952           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
1953    
1954         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1955         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1956         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1957    
1958           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1959    
1960         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
1961         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1962         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
1963         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
1964         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
1965    
1966           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1967    
1968         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1969         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1970         returned by pcre_exec().         returned by pcre_exec().
1971    
1972           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1973    
1974         The backtracking limit, as specified by  the  match_limit  field  in  a         The  backtracking  limit,  as  specified  by the match_limit field in a
1975         pcre_extra  structure  (or  defaulted) was reached. See the description         pcre_extra structure (or defaulted) was reached.  See  the  description
1976         above.         above.
1977    
1978           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
1979    
1980         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
1981         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
1982         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
1983    
1984           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
1985    
1986         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
1987         subject.         subject.
1988    
1989           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
1990    
1991         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
1992         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
1993         ter.         ter.
1994    
1995           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
1996    
1997         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
1998         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
1999    
2000           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2001    
2002         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
2003         items  that are not supported for partial matching. See the pcrepartial         items that are not supported for partial matching. See the  pcrepartial
2004         documentation for details of partial matching.         documentation for details of partial matching.
2005    
2006           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2007    
2008         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
2009         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2010    
2011           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2012    
2013         This  error is given if the value of the ovecsize argument is negative.         This error is given if the value of the ovecsize argument is  negative.
2014    
2015           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2016    
2017         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2018         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
2019         description above.         description above.
2020    
2021           PCRE_ERROR_NULLWSLIMIT    (-22)           PCRE_ERROR_NULLWSLIMIT    (-22)
2022    
2023         When a group that can match an empty  substring  is  repeated  with  an         When  a  group  that  can  match an empty substring is repeated with an
2024         unbounded  upper  limit, the subject position at the start of the group         unbounded upper limit, the subject position at the start of  the  group
2025         must be remembered, so that a test for an empty string can be made when         must be remembered, so that a test for an empty string can be made when
2026         the  end  of the group is reached. Some workspace is required for this;         the end of the group is reached. Some workspace is required  for  this;
2027         if it runs out, this error is given.         if it runs out, this error is given.
2028    
2029           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2013  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2046  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2046         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2047              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2048    
2049         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
2050         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
2051         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2052         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
2053         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
2054         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
2055         substrings.         substrings.
2056    
2057         A substring that contains a binary zero is correctly extracted and  has         A  substring that contains a binary zero is correctly extracted and has
2058         a  further zero added on the end, but the result is not, of course, a C         a further zero added on the end, but the result is not, of course, a  C
2059         string.  However, you can process such a string  by  referring  to  the         string.   However,  you  can  process such a string by referring to the
2060         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
2061         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2062         not  adequate for handling strings containing binary zeros, because the         not adequate for handling strings containing binary zeros, because  the
2063         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2064    
2065         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
2066         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
2067         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2068         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2069         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
2070         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2071         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
2072         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
2073         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2074    
2075         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
2076         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
2077         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
2078         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
2079         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
2080         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
2081         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
2082         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
2083         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2084    
2085           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2086    
2087         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
2088         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2089    
2090           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2091    
2092         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2093    
2094         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
2095         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
2096         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2097         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
2098         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
2099         pointer. The yield of the function is zero if all  went  well,  or  the         pointer.  The  yield  of  the function is zero if all went well, or the
2100         error code         error code
2101    
2102           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2103    
2104         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2105    
2106         When  any of these functions encounter a substring that is unset, which         When any of these functions encounter a substring that is unset,  which
2107         can happen when capturing subpattern number n+1 matches  some  part  of         can  happen  when  capturing subpattern number n+1 matches some part of
2108         the  subject, but subpattern n has not been used at all, they return an         the subject, but subpattern n has not been used at all, they return  an
2109         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2110         string  by inspecting the appropriate offset in ovector, which is nega-         string by inspecting the appropriate offset in ovector, which is  nega-
2111         tive for unset substrings.         tive for unset substrings.
2112    
2113         The two convenience functions pcre_free_substring() and  pcre_free_sub-         The  two convenience functions pcre_free_substring() and pcre_free_sub-
2114         string_list()  can  be  used  to free the memory returned by a previous         string_list() can be used to free the memory  returned  by  a  previous
2115         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2116         tively.  They  do  nothing  more  than  call the function pointed to by         tively. They do nothing more than  call  the  function  pointed  to  by
2117         pcre_free, which of course could be called directly from a  C  program.         pcre_free,  which  of course could be called directly from a C program.
2118         However,  PCRE is used in some situations where it is linked via a spe-         However, PCRE is used in some situations where it is linked via a  spe-
2119         cial  interface  to  another  programming  language  that  cannot   use         cial   interface  to  another  programming  language  that  cannot  use
2120         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free directly; it is for these cases that the functions  are  pro-
2121         vided.         vided.
2122    
2123    
# Line 2103  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2136  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2136              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2137              const char **stringptr);              const char **stringptr);
2138    
2139         To extract a substring by name, you first have to find associated  num-         To  extract a substring by name, you first have to find associated num-
2140         ber.  For example, for this pattern         ber.  For example, for this pattern
2141    
2142           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2112  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2145  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2145         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2146         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2147         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2148         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
2149         subpattern of that name.         subpattern of that name.
2150    
2151         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2152         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2153         are also two functions that do the whole job.         are also two functions that do the whole job.
2154    
2155         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most    of    the    arguments   of   pcre_copy_named_substring()   and
2156         pcre_get_named_substring()  are  the  same  as  those for the similarly         pcre_get_named_substring() are the same  as  those  for  the  similarly
2157         named functions that extract by number. As these are described  in  the         named  functions  that extract by number. As these are described in the
2158         previous  section,  they  are not re-described here. There are just two         previous section, they are not re-described here. There  are  just  two
2159         differences:         differences:
2160    
2161         First, instead of a substring number, a substring name is  given.  Sec-         First,  instead  of a substring number, a substring name is given. Sec-
2162         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2163         to the compiled pattern. This is needed in order to gain access to  the         to  the compiled pattern. This is needed in order to gain access to the
2164         name-to-number translation table.         name-to-number translation table.
2165    
2166         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These functions call pcre_get_stringnumber(), and if it succeeds,  they
2167         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
2168         ate.         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
2169           behaviour may not be what you want (see the next section).
2170    
2171    
2172  DUPLICATE SUBPATTERN NAMES  DUPLICATE SUBPATTERN NAMES
# Line 2351  AUTHOR Line 2385  AUTHOR
2385    
2386  REVISION  REVISION
2387    
2388         Last updated: 06 March 2007         Last updated: 16 April 2007
2389         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2390  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2391    
# Line 2904  BACKSLASH Line 2938  BACKSLASH
2938         is a letter or digit. The definition of  letters  and  digits  is  con-         is a letter or digit. The definition of  letters  and  digits  is  con-
2939         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled  by PCRE's low-valued character tables, and may vary if locale-
2940         specific matching is taking place (see "Locale support" in the  pcreapi         specific matching is taking place (see "Locale support" in the  pcreapi
2941         page).  For  example,  in  the  "fr_FR" (French) locale, some character         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
2942         codes greater than 128 are used for accented  letters,  and  these  are         systems, or "french" in Windows, some character codes greater than  128
2943         matched by \w.         are used for accented letters, and these are matched by \w.
2944    
2945         In  UTF-8 mode, characters with values greater than 128 never match \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
2946         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
# Line 3275  SQUARE BRACKETS AND CHARACTER CLASSES Line 3309  SQUARE BRACKETS AND CHARACTER CLASSES
3309         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
3310         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
3311         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
3312         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
3313         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
3314         concept of case for characters with values greater than 128  only  when         concept of case for characters with values greater than 128  only  when
3315         it is compiled with Unicode property support.         it is compiled with Unicode property support.
# Line 4503  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4537  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4537         not always produce exactly the same result as matching over one  single         not always produce exactly the same result as matching over one  single
4538         long  string.   The  difference arises when there are multiple matching         long  string.   The  difference arises when there are multiple matching
4539         possibilities, because a partial match result is given only when  there         possibilities, because a partial match result is given only when  there
4540         are  no  completed  matches  in a call to fBpcre_dfa_exec(). This means         are  no completed matches in a call to pcre_dfa_exec(). This means that
4541         that as soon as the shortest match has been found,  continuation  to  a         as soon as the shortest match has been found,  continuation  to  a  new
4542         new  subject  segment  is  no  longer possible.  Consider this pcretest         subject segment is no longer possible.  Consider this pcretest example:
        example:  
4543    
4544             re> /dog(sbody)?/             re> /dog(sbody)?/
4545           data> do\P\D           data> do\P\D

Legend:
Removed from v.123  
changed lines
  Added in v.150

  ViewVC Help
Powered by ViewVC 1.1.5