/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 123 by ph10, Mon Mar 12 15:19:06 2007 UTC revision 153 by ph10, Wed Apr 18 09:12:14 2007 UTC
# Line 72  USER DOCUMENTATION Line 72  USER DOCUMENTATION
72         of searching. The sections are as follows:         of searching. The sections are as follows:
73    
74           pcre              this document           pcre              this document
75             pcre-config       show PCRE installation configuration information
76           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
77           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
78           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
# Line 215  AUTHOR Line 216  AUTHOR
216         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
217    
218         Putting an actual email address here seems to have been a spam  magnet,         Putting an actual email address here seems to have been a spam  magnet,
219         so I've taken it away. If you want to email me, use my initial and sur-         so  I've  taken  it away. If you want to email me, use my two initials,
220         name, separated by a dot, at the domain ucs.cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
221    
222    
223  REVISION  REVISION
224    
225         Last updated: 06 March 2007         Last updated: 18 April 2007
226         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
227  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
228    
# Line 244  PCRE BUILD-TIME OPTIONS Line 245  PCRE BUILD-TIME OPTIONS
245    
246           ./configure --help           ./configure --help
247    
248         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
249         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
250         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
251         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
252         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
253         not described.         is not described.
254    
255    
256  C++ SUPPORT  C++ SUPPORT
# Line 288  UNICODE CHARACTER PROPERTY SUPPORT Line 289  UNICODE CHARACTER PROPERTY SUPPORT
289         to the configure command. This implies UTF-8 support, even if you  have         to the configure command. This implies UTF-8 support, even if you  have
290         not explicitly requested it.         not explicitly requested it.
291    
292         Including  Unicode  property  support  adds around 90K of tables to the         Including  Unicode  property  support  adds around 30K of tables to the
293         PCRE library, approximately doubling its size. Only the  general  cate-         PCRE library. Only the general category properties such as  Lu  and  Nd
294         gory  properties  such as Lu and Nd are supported. Details are given in         are supported. Details are given in the pcrepattern documentation.
        the pcrepattern documentation.  
295    
296    
297  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
298    
299         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating
300         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
301         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use character 13 (carriage return, CR)
302         instead, by adding         instead, by adding
303    
304           --enable-newline-is-cr           --enable-newline-is-cr
305    
306         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
307         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
308    
309         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 313  CODE VALUE OF NEWLINE Line 313  CODE VALUE OF NEWLINE
313    
314         to the configure command. There is a fourth option, specified by         to the configure command. There is a fourth option, specified by
315    
316             --enable-newline-is-anycrlf
317    
318           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
319           CRLF as indicating a line ending. Finally, a fifth option, specified by
320    
321           --enable-newline-is-any           --enable-newline-is-any
322    
323         which causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
324    
325         Whatever  line  ending convention is selected when PCRE is built can be         Whatever line ending convention is selected when PCRE is built  can  be
326         overridden when the library functions are called. At build time  it  is         overridden  when  the library functions are called. At build time it is
327         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
328    
329    
330  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
331    
332         The  PCRE building process uses libtool to build both shared and static         The PCRE building process uses libtool to build both shared and  static
333         Unix libraries by default. You can suppress one of these by adding  one         Unix  libraries by default. You can suppress one of these by adding one
334         of         of
335    
336           --disable-shared           --disable-shared
# Line 337  BUILDING SHARED AND STATIC LIBRARIES Line 342  BUILDING SHARED AND STATIC LIBRARIES
342  POSIX MALLOC USAGE  POSIX MALLOC USAGE
343    
344         When PCRE is called through the POSIX interface (see the pcreposix doc-         When PCRE is called through the POSIX interface (see the pcreposix doc-
345         umentation), additional working storage is  required  for  holding  the         umentation),  additional  working  storage  is required for holding the
346         pointers  to capturing substrings, because PCRE requires three integers         pointers to capturing substrings, because PCRE requires three  integers
347         per substring, whereas the POSIX interface provides only  two.  If  the         per  substring,  whereas  the POSIX interface provides only two. If the
348         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
349         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
350         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 352  POSIX MALLOC USAGE Line 357  POSIX MALLOC USAGE
357    
358  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
359    
360         Within a compiled pattern, offset values are used  to  point  from  one         Within  a  compiled  pattern,  offset values are used to point from one
361         part  to another (for example, from an opening parenthesis to an alter-         part to another (for example, from an opening parenthesis to an  alter-
362         nation metacharacter). By default, two-byte values are used  for  these         nation  metacharacter).  By default, two-byte values are used for these
363         offsets,  leading  to  a  maximum size for a compiled pattern of around         offsets, leading to a maximum size for a  compiled  pattern  of  around
364         64K. This is sufficient to handle all but the most  gigantic  patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
365         Nevertheless,  some  people do want to process enormous patterns, so it         Nevertheless, some people do want to process enormous patterns,  so  it
366         is possible to compile PCRE to use three-byte or four-byte  offsets  by         is  possible  to compile PCRE to use three-byte or four-byte offsets by
367         adding a setting such as         adding a setting such as
368    
369           --with-link-size=3           --with-link-size=3
370    
371         to  the  configure  command.  The value given must be 2, 3, or 4. Using         to the configure command. The value given must be 2,  3,  or  4.  Using
372         longer offsets slows down the operation of PCRE because it has to  load         longer  offsets slows down the operation of PCRE because it has to load
373         additional bytes when handling them.         additional bytes when handling them.
374    
        If  you  build  PCRE with an increased link size, test 2 (and test 5 if  
        you are using UTF-8) will fail. Part of the output of these tests is  a  
        representation  of the compiled pattern, and this changes with the link  
        size.  
   
375    
376  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
377    
# Line 429  LIMITING PCRE RESOURCE USAGE Line 429  LIMITING PCRE RESOURCE USAGE
429         time.         time.
430    
431    
432    CREATING CHARACTER TABLES AT BUILD TIME
433    
434           PCRE uses fixed tables for processing characters whose code values  are
435           less  than 256. By default, PCRE is built with a set of tables that are
436           distributed in the file pcre_chartables.c.dist. These  tables  are  for
437           ASCII codes only. If you add
438    
439             --enable-rebuild-chartables
440    
441           to  the  configure  command, the distributed tables are no longer used.
442           Instead, a program called dftables is compiled and  run.  This  outputs
443           the source for new set of tables, created in the default locale of your
444           C runtime system. (This method of replacing the tables does not work if
445           you  are cross compiling, because dftables is run on the local host. If
446           you need to create alternative tables when cross  compiling,  you  will
447           have to do so "by hand".)
448    
449    
450  USING EBCDIC CODE  USING EBCDIC CODE
451    
452         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
453         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
454         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by
455         adding         adding
456    
457           --enable-ebcdic           --enable-ebcdic
458    
459         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
460           bles.
461    
462    
463  SEE ALSO  SEE ALSO
# Line 455  AUTHOR Line 474  AUTHOR
474    
475  REVISION  REVISION
476    
477         Last updated: 06 March 2007         Last updated: 16 April 2007
478         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
479  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
480    
# Line 508  REGULAR EXPRESSIONS AS TREES Line 527  REGULAR EXPRESSIONS AS TREES
527    
528  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
529    
530         In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
531         sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
532         depth-first search of the pattern tree. That is, it  proceeds  along  a         depth-first search of the pattern tree. That is, it  proceeds  along  a
533         single path through the tree, checking that the subject matches what is         single path through the tree, checking that the subject matches what is
534         required. When there is a mismatch, the algorithm  tries  any  alterna-         required. When there is a mismatch, the algorithm  tries  any  alterna-
# Line 828  PCRE API OVERVIEW Line 847  PCRE API OVERVIEW
847    
848  NEWLINES  NEWLINES
849    
850         PCRE  supports four different conventions for indicating line breaks in         PCRE  supports five different conventions for indicating line breaks in
851         strings: a single CR (carriage return) character, a  single  LF  (line-         strings: a single CR (carriage return) character, a  single  LF  (line-
852         feed)  character,  the two-character sequence CRLF, or any Unicode new-         feed) character, the two-character sequence CRLF, any of the three pre-
853         line sequence.  The Unicode newline sequences are the three  just  men-         ceding, or any Unicode newline sequence. The Unicode newline  sequences
854         tioned, plus the single characters VT (vertical tab, U+000B), FF (form-         are  the  three just mentioned, plus the single characters VT (vertical
855         feed, U+000C), NEL (next line, U+0085), LS  (line  separator,  U+2028),         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
856         and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
857    
858         Each  of  the first three conventions is used by at least one operating         Each  of  the first three conventions is used by at least one operating
859         system as its standard newline sequence. When PCRE is built, a  default         system as its standard newline sequence. When PCRE is built, a  default
# Line 899  CHECKING BUILD-TIME OPTIONS Line 918  CHECKING BUILD-TIME OPTIONS
918    
919         The output is an integer whose value specifies  the  default  character         The output is an integer whose value specifies  the  default  character
920         sequence  that is recognized as meaning "newline". The four values that         sequence  that is recognized as meaning "newline". The four values that
921         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY.         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
922         The default should normally be the standard sequence for your operating         and  -1  for  ANY. The default should normally be the standard sequence
923         system.         for your operating system.
924    
925           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
926    
# Line 1125  COMPILING A PATTERN Line 1144  COMPILING A PATTERN
1144           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1145           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1146           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1147             PCRE_NEWLINE_ANYCRLF
1148           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1149    
1150         These  options  override the default newline definition that was chosen         These  options  override the default newline definition that was chosen
1151         when PCRE was built. Setting the first or the second specifies  that  a         when PCRE was built. Setting the first or the second specifies  that  a
1152         newline  is  indicated  by a single character (CR or LF, respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1153         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1154         two-character  CRLF  sequence.  Setting PCRE_NEWLINE_ANY specifies that         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1155         any Unicode newline sequence should be recognized. The Unicode  newline         that any of the three preceding sequences should be recognized. Setting
1156         sequences  are  the three just mentioned, plus the single characters VT         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1157         (vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085),         recognized. The Unicode newline sequences are the three just mentioned,
1158         LS  (line separator, U+2028), and PS (paragraph separator, U+2029). The         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1159         last two are recognized only in UTF-8 mode.         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1160           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1161           UTF-8 mode.
1162    
1163         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
1164         treated  as  a  number, giving eight possibilities. Currently only five         treated as a number, giving eight possibilities. Currently only six are
1165         are used (default plus the four values above). This means that  if  you         used (default plus the five values above). This means that if  you  set
1166         set  more  than  one  newline option, the combination may or may not be         more  than one newline option, the combination may or may not be sensi-
1167         sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is  equiva-         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1168         lent  to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1169         and cause an error.         cause an error.
1170    
1171         The only time that a line break is specially recognized when  compiling         The only time that a line break is specially recognized when  compiling
1172         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
# Line 1310  STUDYING A PATTERN Line 1332  STUDYING A PATTERN
1332  LOCALE SUPPORT  LOCALE SUPPORT
1333    
1334         PCRE handles caseless matching, and determines whether  characters  are         PCRE handles caseless matching, and determines whether  characters  are
1335         letters  digits,  or whatever, by reference to a set of tables, indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1336         by character value. When running in UTF-8 mode, this  applies  only  to         by character value. When running in UTF-8 mode, this  applies  only  to
1337         characters  with  codes  less than 128. Higher-valued codes never match         characters  with  codes  less than 128. Higher-valued codes never match
1338         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1339         with  Unicode  character property support. The use of locales with Uni-         with  Unicode  character property support. The use of locales with Uni-
1340         code is discouraged.         code is discouraged. If you are handling characters with codes  greater
1341           than  128, you should either use UTF-8 and Unicode, or use locales, but
1342         An internal set of tables is created in the default C locale when  PCRE         not try to mix the two.
1343         is  built.  This  is  used when the final argument of pcre_compile() is  
1344         NULL, and is sufficient for many applications. An  alternative  set  of         PCRE contains an internal set of tables that are used  when  the  final
1345         tables  can,  however, be supplied. These may be created in a different         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1346         locale from the default. As more and more applications change to  using         applications.  Normally, the internal tables recognize only ASCII char-
1347         Unicode, the need for this locale support is expected to die away.         acters. However, when PCRE is built, it is possible to cause the inter-
1348           nal tables to be rebuilt in the default "C" locale of the local system,
1349         External  tables  are  built by calling the pcre_maketables() function,         which may cause them to be different.
1350         which has no arguments, in the relevant locale. The result can then  be  
1351         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         The  internal tables can always be overridden by tables supplied by the
1352         example, to build and use tables that are appropriate  for  the  French         application that calls PCRE. These may be created in a different locale
1353         locale  (where  accented  characters  with  values greater than 128 are         from  the  default.  As more and more applications change to using Uni-
1354           code, the need for this locale support is expected to die away.
1355    
1356           External tables are built by calling  the  pcre_maketables()  function,
1357           which  has no arguments, in the relevant locale. The result can then be
1358           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1359           example,  to  build  and use tables that are appropriate for the French
1360           locale (where accented characters with  values  greater  than  128  are
1361         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1362    
1363           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1364           tables = pcre_maketables();           tables = pcre_maketables();
1365           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1366    
1367           The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1368           if you are using Windows, the name for the French locale is "french".
1369    
1370         When pcre_maketables() runs, the tables are built  in  memory  that  is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1371         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1372         that the memory containing the tables remains available for as long  as         that the memory containing the tables remains available for as long  as
# Line 1702  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1734  MATCHING A PATTERN: THE TRADITIONAL FUNC
1734           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1735           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1736           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1737             PCRE_NEWLINE_ANYCRLF
1738           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1739    
1740         These options override  the  newline  definition  that  was  chosen  or         These options override  the  newline  definition  that  was  chosen  or
# Line 1709  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1742  MATCHING A PATTERN: THE TRADITIONAL FUNC
1742         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion of pcre_compile()  above.  During  matching,  the  newline  choice
1743         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
1744         ters. It may also alter the way the match position is advanced after  a         ters. It may also alter the way the match position is advanced after  a
1745         match  failure  for  an  unanchored  pattern. When PCRE_NEWLINE_CRLF or         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1746         PCRE_NEWLINE_ANY is set, and a match attempt  fails  when  the  current         PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a  match  attempt
1747         position  is  at a CRLF sequence, the match position is advanced by two         fails  when the current position is at a CRLF sequence, the match posi-
1748         characters instead of one, in other words, to after the CRLF.         tion is advanced by two characters instead of one, in other  words,  to
1749           after the CRLF.
1750    
1751           PCRE_NOTBOL           PCRE_NOTBOL
1752    
1753         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1754         the  beginning  of  a  line, so the circumflex metacharacter should not         the beginning of a line, so the  circumflex  metacharacter  should  not
1755         match before it. Setting this without PCRE_MULTILINE (at compile  time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1756         causes  circumflex  never to match. This option affects only the behav-         causes circumflex never to match. This option affects only  the  behav-
1757         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1758    
1759           PCRE_NOTEOL           PCRE_NOTEOL
1760    
1761         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
1762         of  a line, so the dollar metacharacter should not match it nor (except         of a line, so the dollar metacharacter should not match it nor  (except
1763         in multiline mode) a newline immediately before it. Setting this  with-         in  multiline mode) a newline immediately before it. Setting this with-
1764         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1765         option affects only the behaviour of the dollar metacharacter. It  does         option  affects only the behaviour of the dollar metacharacter. It does
1766         not affect \Z or \z.         not affect \Z or \z.
1767    
1768           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1769    
1770         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1771         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
1772         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
1773         example, if the pattern         example, if the pattern
1774    
1775           a?b?           a?b?
1776    
1777         is applied to a string not beginning with "a" or "b",  it  matches  the         is  applied  to  a string not beginning with "a" or "b", it matches the
1778         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
1779         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1780         rences of "a" or "b".         rences of "a" or "b".
1781    
1782         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1783         cial case of a pattern match of the empty  string  within  its  split()         cial  case  of  a  pattern match of the empty string within its split()
1784         function,  and  when  using  the /g modifier. It is possible to emulate         function, and when using the /g modifier. It  is  possible  to  emulate
1785         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1786         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1787         if that fails by advancing the starting offset (see below)  and  trying         if  that  fails by advancing the starting offset (see below) and trying
1788         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
1789         this in the pcredemo.c sample program.         this in the pcredemo.c sample program.
1790    
1791           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1792    
1793         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1794         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8 string is automatically checked when pcre_exec() is  subsequently
1795         called.  The value of startoffset is also checked  to  ensure  that  it         called.   The  value  of  startoffset is also checked to ensure that it
1796         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence         points to the start of a UTF-8 character. If an invalid UTF-8  sequence
1797         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1798         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is         startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is
1799         returned.         returned.
1800    
1801         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
1802         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
1803         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1804         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
1805         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
1806         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
1807         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1808         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1809         value of startoffset that does not point to the start of a UTF-8  char-         value  of startoffset that does not point to the start of a UTF-8 char-
1810         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
1811    
1812           PCRE_PARTIAL           PCRE_PARTIAL
1813    
1814         This  option  turns  on  the  partial  matching feature. If the subject         This option turns on the  partial  matching  feature.  If  the  subject
1815         string fails to match the pattern, but at some point during the  match-         string  fails to match the pattern, but at some point during the match-
1816         ing  process  the  end of the subject was reached (that is, the subject         ing process the end of the subject was reached (that  is,  the  subject
1817         partially matches the pattern and the failure to  match  occurred  only         partially  matches  the  pattern and the failure to match occurred only
1818         because  there were not enough subject characters), pcre_exec() returns         because there were not enough subject characters), pcre_exec()  returns
1819         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1820         used,  there  are restrictions on what may appear in the pattern. These         used, there are restrictions on what may appear in the  pattern.  These
1821         are discussed in the pcrepartial documentation.         are discussed in the pcrepartial documentation.
1822    
1823     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
1824    
1825         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
1826         length  in  length, and a starting byte offset in startoffset. In UTF-8         length in length, and a starting byte offset in startoffset.  In  UTF-8
1827         mode, the byte offset must point to the start  of  a  UTF-8  character.         mode,  the  byte  offset  must point to the start of a UTF-8 character.
1828         Unlike  the  pattern string, the subject may contain binary zero bytes.         Unlike the pattern string, the subject may contain binary  zero  bytes.
1829         When the starting offset is zero, the search for a match starts at  the         When  the starting offset is zero, the search for a match starts at the
1830         beginning of the subject, and this is by far the most common case.         beginning of the subject, and this is by far the most common case.
1831    
1832         A  non-zero  starting offset is useful when searching for another match         A non-zero starting offset is useful when searching for  another  match
1833         in the same subject by calling pcre_exec() again after a previous  suc-         in  the same subject by calling pcre_exec() again after a previous suc-
1834         cess.   Setting  startoffset differs from just passing over a shortened         cess.  Setting startoffset differs from just passing over  a  shortened
1835         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
1836         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1837    
1838           \Biss\B           \Biss\B
1839    
1840         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
1841         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
1842         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
1843         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
1844         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
1845         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1846         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
1847         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
1848         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
1849         discover that it is preceded by a letter.         discover that it is preceded by a letter.
1850    
1851         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
1852         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
1853         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
1854         subject.         subject.
1855    
1856     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
1857    
1858         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
1859         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
1860         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
1861         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
1862         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
1863         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
1864         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1865    
1866         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of  integer
1867         offsets whose address is passed in ovector. The number of  elements  in         offsets  whose  address is passed in ovector. The number of elements in
1868         the  vector is passed in ovecsize, which must be a non-negative number.         the vector is passed in ovecsize, which must be a non-negative  number.
1869         Note: this argument is NOT the size of ovector in bytes.         Note: this argument is NOT the size of ovector in bytes.
1870    
1871         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
1872         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
1873         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
1874         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
1875         The length passed in ovecsize should always be a multiple of three.  If         The  length passed in ovecsize should always be a multiple of three. If
1876         it is not, it is rounded down.         it is not, it is rounded down.
1877    
1878         When  a  match  is successful, information about captured substrings is         When a match is successful, information about  captured  substrings  is
1879         returned in pairs of integers, starting at the  beginning  of  ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
1880         and  continuing  up  to two-thirds of its length at the most. The first         and continuing up to two-thirds of its length at the  most.  The  first
1881         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1882         string,  and  the  second  is  set to the offset of the first character         string, and the second is set to the  offset  of  the  first  character
1883         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1884         tor[1],  identify  the  portion  of  the  subject string matched by the         tor[1], identify the portion of  the  subject  string  matched  by  the
1885         entire pattern. The next pair is used for the first  capturing  subpat-         entire  pattern.  The next pair is used for the first capturing subpat-
1886         tern, and so on. The value returned by pcre_exec() is one more than the         tern, and so on. The value returned by pcre_exec() is one more than the
1887         highest numbered pair that has been set. For example, if two substrings         highest numbered pair that has been set. For example, if two substrings
1888         have  been captured, the returned value is 3. If there are no capturing         have been captured, the returned value is 3. If there are no  capturing
1889         subpatterns, the return value from a successful match is 1,  indicating         subpatterns,  the return value from a successful match is 1, indicating
1890         that just the first pair of offsets has been set.         that just the first pair of offsets has been set.
1891    
1892         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1893         of the string that it matched that is returned.         of the string that it matched that is returned.
1894    
1895         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
1896         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1897         function returns a value of zero. In particular, if the substring  off-         function  returns a value of zero. In particular, if the substring off-
1898         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1899         as NULL and ovecsize as zero. However, if  the  pattern  contains  back         as  NULL  and  ovecsize  as zero. However, if the pattern contains back
1900         references  and  the  ovector is not big enough to remember the related         references and the ovector is not big enough to  remember  the  related
1901         substrings, PCRE has to get additional memory for use during  matching.         substrings,  PCRE has to get additional memory for use during matching.
1902         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1903    
1904         The  pcre_info()  function  can  be used to find out how many capturing         The pcre_info() function can be used to find  out  how  many  capturing
1905         subpatterns there are in a compiled  pattern.  The  smallest  size  for         subpatterns  there  are  in  a  compiled pattern. The smallest size for
1906         ovector  that  will allow for n captured substrings, in addition to the         ovector that will allow for n captured substrings, in addition  to  the
1907         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1908    
1909         It is possible for capturing subpattern number n+1 to match  some  part         It  is  possible for capturing subpattern number n+1 to match some part
1910         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
1911         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1912         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
1913         2 is not. When this happens, both values in  the  offset  pairs  corre-         2  is  not.  When  this happens, both values in the offset pairs corre-
1914         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
1915    
1916         Offset  values  that correspond to unused subpatterns at the end of the         Offset values that correspond to unused subpatterns at the end  of  the
1917         expression are also set to -1. For example,  if  the  string  "abc"  is         expression  are  also  set  to  -1. For example, if the string "abc" is
1918         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
1919         matched. The return from the function is 2, because  the  highest  used         matched.  The  return  from the function is 2, because the highest used
1920         capturing subpattern number is 1. However, you can refer to the offsets         capturing subpattern number is 1. However, you can refer to the offsets
1921         for the second and third capturing subpatterns if  you  wish  (assuming         for  the  second  and third capturing subpatterns if you wish (assuming
1922         the vector is large enough, of course).         the vector is large enough, of course).
1923    
1924         Some  convenience  functions  are  provided for extracting the captured         Some convenience functions are provided  for  extracting  the  captured
1925         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
1926    
1927     Error return values from pcre_exec()     Error return values from pcre_exec()
1928    
1929         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
1930         defined in the header file:         defined in the header file:
1931    
1932           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1901  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1935  MATCHING A PATTERN: THE TRADITIONAL FUNC
1935    
1936           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1937    
1938         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1939         ovecsize was not zero.         ovecsize was not zero.
1940    
1941           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1910  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1944  MATCHING A PATTERN: THE TRADITIONAL FUNC
1944    
1945           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1946    
1947         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1948         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1949         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1950         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
1951         gives when the magic number is not present.         gives when the magic number is not present.
1952    
1953           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
1954    
1955         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1956         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1957         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1958    
1959           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1960    
1961         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
1962         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1963         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
1964         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
1965         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
1966    
1967           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1968    
1969         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1970         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1971         returned by pcre_exec().         returned by pcre_exec().
1972    
1973           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1974    
1975         The backtracking limit, as specified by  the  match_limit  field  in  a         The  backtracking  limit,  as  specified  by the match_limit field in a
1976         pcre_extra  structure  (or  defaulted) was reached. See the description         pcre_extra structure (or defaulted) was reached.  See  the  description
1977         above.         above.
1978    
1979           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
1980    
1981         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
1982         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
1983         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
1984    
1985           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
1986    
1987         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
1988         subject.         subject.
1989    
1990           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
1991    
1992         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
1993         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
1994         ter.         ter.
1995    
1996           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
1997    
1998         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
1999         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2000    
2001           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2002    
2003         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
2004         items  that are not supported for partial matching. See the pcrepartial         items that are not supported for partial matching. See the  pcrepartial
2005         documentation for details of partial matching.         documentation for details of partial matching.
2006    
2007           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2008    
2009         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
2010         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2011    
2012           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2013    
2014         This  error is given if the value of the ovecsize argument is negative.         This error is given if the value of the ovecsize argument is  negative.
2015    
2016           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2017    
2018         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2019         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
2020         description above.         description above.
2021    
2022           PCRE_ERROR_NULLWSLIMIT    (-22)           PCRE_ERROR_NULLWSLIMIT    (-22)
2023    
2024         When a group that can match an empty  substring  is  repeated  with  an         When  a  group  that  can  match an empty substring is repeated with an
2025         unbounded  upper  limit, the subject position at the start of the group         unbounded upper limit, the subject position at the start of  the  group
2026         must be remembered, so that a test for an empty string can be made when         must be remembered, so that a test for an empty string can be made when
2027         the  end  of the group is reached. Some workspace is required for this;         the end of the group is reached. Some workspace is required  for  this;
2028         if it runs out, this error is given.         if it runs out, this error is given.
2029    
2030           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2013  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2047  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2047         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2048              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2049    
2050         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
2051         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
2052         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2053         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
2054         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
2055         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
2056         substrings.         substrings.
2057    
2058         A substring that contains a binary zero is correctly extracted and  has         A  substring that contains a binary zero is correctly extracted and has
2059         a  further zero added on the end, but the result is not, of course, a C         a further zero added on the end, but the result is not, of course, a  C
2060         string.  However, you can process such a string  by  referring  to  the         string.   However,  you  can  process such a string by referring to the
2061         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
2062         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2063         not  adequate for handling strings containing binary zeros, because the         not adequate for handling strings containing binary zeros, because  the
2064         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2065    
2066         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
2067         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
2068         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2069         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2070         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
2071         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2072         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
2073         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
2074         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2075    
2076         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
2077         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
2078         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
2079         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
2080         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
2081         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
2082         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
2083         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
2084         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2085    
2086           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2087    
2088         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
2089         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2090    
2091           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2092    
2093         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2094    
2095         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
2096         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
2097         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2098         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
2099         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
2100         pointer. The yield of the function is zero if all  went  well,  or  the         pointer.  The  yield  of  the function is zero if all went well, or the
2101         error code         error code
2102    
2103           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2104    
2105         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2106    
2107         When  any of these functions encounter a substring that is unset, which         When any of these functions encounter a substring that is unset,  which
2108         can happen when capturing subpattern number n+1 matches  some  part  of         can  happen  when  capturing subpattern number n+1 matches some part of
2109         the  subject, but subpattern n has not been used at all, they return an         the subject, but subpattern n has not been used at all, they return  an
2110         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2111         string  by inspecting the appropriate offset in ovector, which is nega-         string by inspecting the appropriate offset in ovector, which is  nega-
2112         tive for unset substrings.         tive for unset substrings.
2113    
2114         The two convenience functions pcre_free_substring() and  pcre_free_sub-         The  two convenience functions pcre_free_substring() and pcre_free_sub-
2115         string_list()  can  be  used  to free the memory returned by a previous         string_list() can be used to free the memory  returned  by  a  previous
2116         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2117         tively.  They  do  nothing  more  than  call the function pointed to by         tively. They do nothing more than  call  the  function  pointed  to  by
2118         pcre_free, which of course could be called directly from a  C  program.         pcre_free,  which  of course could be called directly from a C program.
2119         However,  PCRE is used in some situations where it is linked via a spe-         However, PCRE is used in some situations where it is linked via a  spe-
2120         cial  interface  to  another  programming  language  that  cannot   use         cial   interface  to  another  programming  language  that  cannot  use
2121         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free directly; it is for these cases that the functions  are  pro-
2122         vided.         vided.
2123    
2124    
# Line 2103  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2137  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2137              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2138              const char **stringptr);              const char **stringptr);
2139    
2140         To extract a substring by name, you first have to find associated  num-         To  extract a substring by name, you first have to find associated num-
2141         ber.  For example, for this pattern         ber.  For example, for this pattern
2142    
2143           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2112  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2146  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2146         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2147         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2148         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2149         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
2150         subpattern of that name.         subpattern of that name.
2151    
2152         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2153         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2154         are also two functions that do the whole job.         are also two functions that do the whole job.
2155    
2156         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most    of    the    arguments   of   pcre_copy_named_substring()   and
2157         pcre_get_named_substring()  are  the  same  as  those for the similarly         pcre_get_named_substring() are the same  as  those  for  the  similarly
2158         named functions that extract by number. As these are described  in  the         named  functions  that extract by number. As these are described in the
2159         previous  section,  they  are not re-described here. There are just two         previous section, they are not re-described here. There  are  just  two
2160         differences:         differences:
2161    
2162         First, instead of a substring number, a substring name is  given.  Sec-         First,  instead  of a substring number, a substring name is given. Sec-
2163         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2164         to the compiled pattern. This is needed in order to gain access to  the         to  the compiled pattern. This is needed in order to gain access to the
2165         name-to-number translation table.         name-to-number translation table.
2166    
2167         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These functions call pcre_get_stringnumber(), and if it succeeds,  they
2168         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
2169         ate.         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
2170           behaviour may not be what you want (see the next section).
2171    
2172    
2173  DUPLICATE SUBPATTERN NAMES  DUPLICATE SUBPATTERN NAMES
# Line 2351  AUTHOR Line 2386  AUTHOR
2386    
2387  REVISION  REVISION
2388    
2389         Last updated: 06 March 2007         Last updated: 16 April 2007
2390         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2391  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2392    
# Line 2904  BACKSLASH Line 2939  BACKSLASH
2939         is a letter or digit. The definition of  letters  and  digits  is  con-         is a letter or digit. The definition of  letters  and  digits  is  con-
2940         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled  by PCRE's low-valued character tables, and may vary if locale-
2941         specific matching is taking place (see "Locale support" in the  pcreapi         specific matching is taking place (see "Locale support" in the  pcreapi
2942         page).  For  example,  in  the  "fr_FR" (French) locale, some character         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
2943         codes greater than 128 are used for accented  letters,  and  these  are         systems, or "french" in Windows, some character codes greater than  128
2944         matched by \w.         are used for accented letters, and these are matched by \w.
2945    
2946         In  UTF-8 mode, characters with values greater than 128 never match \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
2947         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
# Line 3275  SQUARE BRACKETS AND CHARACTER CLASSES Line 3310  SQUARE BRACKETS AND CHARACTER CLASSES
3310         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
3311         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
3312         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
3313         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
3314         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
3315         concept of case for characters with values greater than 128  only  when         concept of case for characters with values greater than 128  only  when
3316         it is compiled with Unicode property support.         it is compiled with Unicode property support.
# Line 4503  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4538  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4538         not always produce exactly the same result as matching over one  single         not always produce exactly the same result as matching over one  single
4539         long  string.   The  difference arises when there are multiple matching         long  string.   The  difference arises when there are multiple matching
4540         possibilities, because a partial match result is given only when  there         possibilities, because a partial match result is given only when  there
4541         are  no  completed  matches  in a call to fBpcre_dfa_exec(). This means         are  no completed matches in a call to pcre_dfa_exec(). This means that
4542         that as soon as the shortest match has been found,  continuation  to  a         as soon as the shortest match has been found,  continuation  to  a  new
4543         new  subject  segment  is  no  longer possible.  Consider this pcretest         subject segment is no longer possible.  Consider this pcretest example:
        example:  
4544    
4545             re> /dog(sbody)?/             re> /dog(sbody)?/
4546           data> do\P\D           data> do\P\D

Legend:
Removed from v.123  
changed lines
  Added in v.153

  ViewVC Help
Powered by ViewVC 1.1.5