/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 155 by ph10, Tue Apr 24 13:36:11 2007 UTC revision 172 by ph10, Tue Jun 5 10:40:13 2007 UTC
# Line 610  THE ALTERNATIVE MATCHING ALGORITHM Line 610  THE ALTERNATIVE MATCHING ALGORITHM
610         ence  as  the  condition or test for a specific group recursion are not         ence  as  the  condition or test for a specific group recursion are not
611         supported.         supported.
612    
613         5. Callouts are supported, but the value of the  capture_top  field  is         5. Because many paths through the tree may be  active,  the  \K  escape
614           sequence, which resets the start of the match when encountered (but may
615           be on some paths and not on others), is not  supported.  It  causes  an
616           error if encountered.
617    
618           6.  Callouts  are  supported, but the value of the capture_top field is
619         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
620    
621         6.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
622         single byte, even in UTF-8 mode, is not supported because the  alterna-         single  byte, even in UTF-8 mode, is not supported because the alterna-
623         tive  algorithm  moves  through  the  subject string one character at a         tive algorithm moves through the subject  string  one  character  at  a
624         time, for all active paths through the tree.         time, for all active paths through the tree.
625    
626    
627  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
628    
629         Using the alternative matching algorithm provides the following  advan-         Using  the alternative matching algorithm provides the following advan-
630         tages:         tages:
631    
632         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
633         ically found, and in particular, the longest match is  found.  To  find         ically  found,  and  in particular, the longest match is found. To find
634         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
635         things with callouts.         things with callouts.
636    
637         2. There is much better support for partial matching. The  restrictions         2.  There is much better support for partial matching. The restrictions
638         on  the content of the pattern that apply when using the standard algo-         on the content of the pattern that apply when using the standard  algo-
639         rithm for partial matching do not apply to the  alternative  algorithm.         rithm  for  partial matching do not apply to the alternative algorithm.
640         For  non-anchored patterns, the starting position of a partial match is         For non-anchored patterns, the starting position of a partial match  is
641         available.         available.
642    
643         3. Because the alternative algorithm  scans  the  subject  string  just         3.  Because  the  alternative  algorithm  scans the subject string just
644         once,  and  never  needs to backtrack, it is possible to pass very long         once, and never needs to backtrack, it is possible to  pass  very  long
645         subject strings to the matching function in  several  pieces,  checking         subject  strings  to  the matching function in several pieces, checking
646         for partial matching each time.         for partial matching each time.
647    
648    
# Line 645  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 650  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
650    
651         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
652    
653         1.  It  is  substantially  slower  than the standard algorithm. This is         1. It is substantially slower than  the  standard  algorithm.  This  is
654         partly because it has to search for all possible matches, but  is  also         partly  because  it has to search for all possible matches, but is also
655         because it is less susceptible to optimization.         because it is less susceptible to optimization.
656    
657         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 664  AUTHOR Line 669  AUTHOR
669    
670  REVISION  REVISION
671    
672         Last updated: 06 March 2007         Last updated: 29 May 2007
673         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
674  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
675    
# Line 1471  INFORMATION ABOUT A PATTERN Line 1476  INFORMATION ABOUT A PATTERN
1476         returned. The fourth argument should point to an unsigned char *  vari-         returned. The fourth argument should point to an unsigned char *  vari-
1477         able.         able.
1478    
1479             PCRE_INFO_JCHANGED
1480    
1481           Return  1  if the (?J) option setting is used in the pattern, otherwise
1482           0. The fourth argument should point to an int variable. The (?J) inter-
1483           nal option setting changes the local PCRE_DUPNAMES value.
1484    
1485           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1486    
1487         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1525  INFORMATION ABOUT A PATTERN Line 1536  INFORMATION ABOUT A PATTERN
1536         name-to-number map, remember that the length of the entries  is  likely         name-to-number map, remember that the length of the entries  is  likely
1537         to be different for each compiled pattern.         to be different for each compiled pattern.
1538    
1539             PCRE_INFO_OKPARTIAL
1540    
1541           Return  1 if the pattern can be used for partial matching, otherwise 0.
1542           The fourth argument should point to an int  variable.  The  pcrepartial
1543           documentation  lists  the restrictions that apply to patterns when par-
1544           tial matching is used.
1545    
1546           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1547    
1548         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1549         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1550         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1551         by any top-level option settings within the pattern itself.         by any top-level option settings within the pattern itself.
1552    
1553         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1554         alternatives begin with one of the following:         alternatives begin with one of the following:
1555    
1556           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1546  INFORMATION ABOUT A PATTERN Line 1564  INFORMATION ABOUT A PATTERN
1564    
1565           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1566    
1567         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1568         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1569         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1570         size_t variable.         size_t variable.
# Line 1554  INFORMATION ABOUT A PATTERN Line 1572  INFORMATION ABOUT A PATTERN
1572           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1573    
1574         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1575         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1576         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1577         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1578         variable.         variable.
1579    
1580    
# Line 1564  OBSOLETE INFO FUNCTION Line 1582  OBSOLETE INFO FUNCTION
1582    
1583         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1584    
1585         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1586         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1587         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1588         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1589         lowing negative numbers:         lowing negative numbers:
1590    
1591           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1592           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1593    
1594         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1595         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1596         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1597    
1598         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1599         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1600         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1601    
1602    
# Line 1586  REFERENCE COUNTS Line 1604  REFERENCE COUNTS
1604    
1605         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1606    
1607         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1608         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1609         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1610         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1611         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1612    
1613         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1614         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1615         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1616         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1617         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1618         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1619    
1620         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1621         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1622         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1623    
1624    
# Line 1610  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1628  MATCHING A PATTERN: THE TRADITIONAL FUNC
1628              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1629              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1630    
1631         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1632         compiled pattern, which is passed in the code argument. If the  pattern         compiled  pattern, which is passed in the code argument. If the pattern
1633         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1634         argument. This function is the main matching facility of  the  library,         argument.  This  function is the main matching facility of the library,
1635         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1636         an alternative matching function, which is described below in the  sec-         an  alternative matching function, which is described below in the sec-
1637         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1638    
1639         In  most applications, the pattern will have been compiled (and option-         In most applications, the pattern will have been compiled (and  option-
1640         ally studied) in the same process that calls pcre_exec().  However,  it         ally  studied)  in the same process that calls pcre_exec(). However, it
1641         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1642         later in different processes, possibly even on different hosts.  For  a         later  in  different processes, possibly even on different hosts. For a
1643         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1644    
1645         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1640  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1658  MATCHING A PATTERN: THE TRADITIONAL FUNC
1658    
1659     Extra data for pcre_exec()     Extra data for pcre_exec()
1660    
1661         If  the  extra argument is not NULL, it must point to a pcre_extra data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1662         block. The pcre_study() function returns such a block (when it  doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1663         return  NULL), but you can also create one for yourself, and pass addi-         return NULL), but you can also create one for yourself, and pass  addi-
1664         tional information in it. The pcre_extra block contains  the  following         tional  information  in it. The pcre_extra block contains the following
1665         fields (not necessarily in this order):         fields (not necessarily in this order):
1666    
1667           unsigned long int flags;           unsigned long int flags;
# Line 1653  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1671  MATCHING A PATTERN: THE TRADITIONAL FUNC
1671           void *callout_data;           void *callout_data;
1672           const unsigned char *tables;           const unsigned char *tables;
1673    
1674         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1675         are set. The flag bits are:         are set. The flag bits are:
1676    
1677           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1662  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1680  MATCHING A PATTERN: THE TRADITIONAL FUNC
1680           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1681           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1682    
1683         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1684         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1685         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1686         add  to  the  block by setting the other fields and their corresponding         add to the block by setting the other fields  and  their  corresponding
1687         flag bits.         flag bits.
1688    
1689         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1690         a  vast amount of resources when running patterns that are not going to         a vast amount of resources when running patterns that are not going  to
1691         match, but which have a very large number  of  possibilities  in  their         match,  but  which  have  a very large number of possibilities in their
1692         search  trees.  The  classic  example  is  the  use of nested unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1693         repeats.         repeats.
1694    
1695         Internally, PCRE uses a function called match() which it calls  repeat-         Internally,  PCRE uses a function called match() which it calls repeat-
1696         edly  (sometimes  recursively). The limit set by match_limit is imposed         edly (sometimes recursively). The limit set by match_limit  is  imposed
1697         on the number of times this function is called during  a  match,  which         on  the  number  of times this function is called during a match, which
1698         has  the  effect  of  limiting the amount of backtracking that can take         has the effect of limiting the amount of  backtracking  that  can  take
1699         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1700         for each position in the subject string.         for each position in the subject string.
1701    
1702         The  default  value  for  the  limit can be set when PCRE is built; the         The default value for the limit can be set  when  PCRE  is  built;  the
1703         default default is 10 million, which handles all but the  most  extreme         default  default  is 10 million, which handles all but the most extreme
1704         cases.  You  can  override  the  default by suppling pcre_exec() with a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1705         pcre_extra    block    in    which    match_limit    is    set,     and         pcre_extra     block    in    which    match_limit    is    set,    and
1706         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1707         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1708    
1709         The match_limit_recursion field is similar to match_limit, but  instead         The  match_limit_recursion field is similar to match_limit, but instead
1710         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1711         the depth of recursion. The recursion depth is a  smaller  number  than         the  depth  of  recursion. The recursion depth is a smaller number than
1712         the  total number of calls, because not all calls to match() are recur-         the total number of calls, because not all calls to match() are  recur-
1713         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1714    
1715         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1716         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1717         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1718    
1719         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1720         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1721         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1722         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1723         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1724         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1725    
1726         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1727         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1728    
1729         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1730         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1731         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1732         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1733         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1734         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1735         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1736         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1737         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1738         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1739    
1740     Option bits for pcre_exec()     Option bits for pcre_exec()
1741    
1742         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1743         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1744         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1745         PCRE_PARTIAL.         PCRE_PARTIAL.
1746    
1747           PCRE_ANCHORED           PCRE_ANCHORED
1748    
1749         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1750         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1751         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1752         unachored at matching time.         unachored at matching time.
1753    
1754           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1739  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1757  MATCHING A PATTERN: THE TRADITIONAL FUNC
1757           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1758           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1759    
1760         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
1761         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
1762         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
1763         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1764         ters. It may also alter the way the match position is advanced after  a         ters.  It may also alter the way the match position is advanced after a
1765         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1766         PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a  match  attempt         PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt
1767         fails  when the current position is at a CRLF sequence, the match posi-         fails when the current position is at a CRLF sequence, the match  posi-
1768         tion is advanced by two characters instead of one, in other  words,  to         tion  is  advanced by two characters instead of one, in other words, to
1769         after the CRLF.         after the CRLF.
1770    
1771           PCRE_NOTBOL           PCRE_NOTBOL
1772    
1773         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1774         the beginning of a line, so the  circumflex  metacharacter  should  not         the  beginning  of  a  line, so the circumflex metacharacter should not
1775         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match before it. Setting this without PCRE_MULTILINE (at compile  time)
1776         causes circumflex never to match. This option affects only  the  behav-         causes  circumflex  never to match. This option affects only the behav-
1777         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1778    
1779           PCRE_NOTEOL           PCRE_NOTEOL
1780    
1781         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
1782         of a line, so the dollar metacharacter should not match it nor  (except         of  a line, so the dollar metacharacter should not match it nor (except
1783         in  multiline mode) a newline immediately before it. Setting this with-         in multiline mode) a newline immediately before it. Setting this  with-
1784         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1785         option  affects only the behaviour of the dollar metacharacter. It does         option affects only the behaviour of the dollar metacharacter. It  does
1786         not affect \Z or \z.         not affect \Z or \z.
1787    
1788           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1789    
1790         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1791         set.  If  there are alternatives in the pattern, they are tried. If all         set. If there are alternatives in the pattern, they are tried.  If  all
1792         the alternatives match the empty string, the entire  match  fails.  For         the  alternatives  match  the empty string, the entire match fails. For
1793         example, if the pattern         example, if the pattern
1794    
1795           a?b?           a?b?
1796    
1797         is  applied  to  a string not beginning with "a" or "b", it matches the         is applied to a string not beginning with "a" or "b",  it  matches  the
1798         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
1799         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1800         rences of "a" or "b".         rences of "a" or "b".
1801    
1802         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1803         cial  case  of  a  pattern match of the empty string within its split()         cial case of a pattern match of the empty  string  within  its  split()
1804         function, and when using the /g modifier. It  is  possible  to  emulate         function,  and  when  using  the /g modifier. It is possible to emulate
1805         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1806         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1807         if  that  fails by advancing the starting offset (see below) and trying         if that fails by advancing the starting offset (see below)  and  trying
1808         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
1809         this in the pcredemo.c sample program.         this in the pcredemo.c sample program.
1810    
1811           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1812    
1813         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1814         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
1815         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
1816         points to the start of a UTF-8 character. If an invalid UTF-8  sequence         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence
1817         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1818         startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is
1819         returned.         returned.
1820    
1821         If  you  already  know that your subject is valid, and you want to skip         If you already know that your subject is valid, and you  want  to  skip
1822         these   checks   for   performance   reasons,   you   can    set    the         these    checks    for   performance   reasons,   you   can   set   the
1823         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
1824         do this for the second and subsequent calls to pcre_exec() if  you  are         do  this  for the second and subsequent calls to pcre_exec() if you are
1825         making  repeated  calls  to  find  all  the matches in a single subject         making repeated calls to find all  the  matches  in  a  single  subject
1826         string. However, you should be  sure  that  the  value  of  startoffset         string.  However,  you  should  be  sure  that the value of startoffset
1827         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is
1828         set, the effect of passing an invalid UTF-8 string as a subject,  or  a         set,  the  effect of passing an invalid UTF-8 string as a subject, or a
1829         value  of startoffset that does not point to the start of a UTF-8 char-         value of startoffset that does not point to the start of a UTF-8  char-
1830         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
1831    
1832           PCRE_PARTIAL           PCRE_PARTIAL
1833    
1834         This option turns on the  partial  matching  feature.  If  the  subject         This  option  turns  on  the  partial  matching feature. If the subject
1835         string  fails to match the pattern, but at some point during the match-         string fails to match the pattern, but at some point during the  match-
1836         ing process the end of the subject was reached (that  is,  the  subject         ing  process  the  end of the subject was reached (that is, the subject
1837         partially  matches  the  pattern and the failure to match occurred only         partially matches the pattern and the failure to  match  occurred  only
1838         because there were not enough subject characters), pcre_exec()  returns         because  there were not enough subject characters), pcre_exec() returns
1839         PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is
1840         used, there are restrictions on what may appear in the  pattern.  These         used,  there  are restrictions on what may appear in the pattern. These
1841         are discussed in the pcrepartial documentation.         are discussed in the pcrepartial documentation.
1842    
1843     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
1844    
1845         The  subject string is passed to pcre_exec() as a pointer in subject, a         The subject string is passed to pcre_exec() as a pointer in subject,  a
1846         length in length, and a starting byte offset in startoffset.  In  UTF-8         length  in  length, and a starting byte offset in startoffset. In UTF-8
1847         mode,  the  byte  offset  must point to the start of a UTF-8 character.         mode, the byte offset must point to the start  of  a  UTF-8  character.
1848         Unlike the pattern string, the subject may contain binary  zero  bytes.         Unlike  the  pattern string, the subject may contain binary zero bytes.
1849         When  the starting offset is zero, the search for a match starts at the         When the starting offset is zero, the search for a match starts at  the
1850         beginning of the subject, and this is by far the most common case.         beginning of the subject, and this is by far the most common case.
1851    
1852         A non-zero starting offset is useful when searching for  another  match         A  non-zero  starting offset is useful when searching for another match
1853         in  the same subject by calling pcre_exec() again after a previous suc-         in the same subject by calling pcre_exec() again after a previous  suc-
1854         cess.  Setting startoffset differs from just passing over  a  shortened         cess.   Setting  startoffset differs from just passing over a shortened
1855         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
1856         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1857    
1858           \Biss\B           \Biss\B
1859    
1860         which finds occurrences of "iss" in the middle of  words.  (\B  matches         which  finds  occurrences  of "iss" in the middle of words. (\B matches
1861         only  if  the  current position in the subject is not a word boundary.)         only if the current position in the subject is not  a  word  boundary.)
1862         When applied to the string "Mississipi" the first call  to  pcre_exec()         When  applied  to the string "Mississipi" the first call to pcre_exec()
1863         finds  the  first  occurrence. If pcre_exec() is called again with just         finds the first occurrence. If pcre_exec() is called  again  with  just
1864         the remainder of the subject,  namely  "issipi",  it  does  not  match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
1865         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1866         to be a word boundary. However, if pcre_exec()  is  passed  the  entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1867         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
1868         rence of "iss" because it is able to look behind the starting point  to         rence  of "iss" because it is able to look behind the starting point to
1869         discover that it is preceded by a letter.         discover that it is preceded by a letter.
1870    
1871         If  a  non-zero starting offset is passed when the pattern is anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
1872         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
1873         if  the  pattern  does  not require the match to be at the start of the         if the pattern does not require the match to be at  the  start  of  the
1874         subject.         subject.
1875    
1876     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
1877    
1878         In general, a pattern matches a certain portion of the subject, and  in         In  general, a pattern matches a certain portion of the subject, and in
1879         addition,  further  substrings  from  the  subject may be picked out by         addition, further substrings from the subject  may  be  picked  out  by
1880         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
1881         this  is  called "capturing" in what follows, and the phrase "capturing         this is called "capturing" in what follows, and the  phrase  "capturing
1882         subpattern" is used for a fragment of a pattern that picks out  a  sub-         subpattern"  is  used for a fragment of a pattern that picks out a sub-
1883         string.  PCRE  supports several other kinds of parenthesized subpattern         string. PCRE supports several other kinds of  parenthesized  subpattern
1884         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1885    
1886         Captured substrings are returned to the caller via a vector of  integer         Captured  substrings are returned to the caller via a vector of integer
1887         offsets  whose  address is passed in ovector. The number of elements in         offsets whose address is passed in ovector. The number of  elements  in
1888         the vector is passed in ovecsize, which must be a non-negative  number.         the  vector is passed in ovecsize, which must be a non-negative number.
1889         Note: this argument is NOT the size of ovector in bytes.         Note: this argument is NOT the size of ovector in bytes.
1890    
1891         The  first  two-thirds of the vector is used to pass back captured sub-         The first two-thirds of the vector is used to pass back  captured  sub-
1892         strings, each substring using a pair of integers. The  remaining  third         strings,  each  substring using a pair of integers. The remaining third
1893         of  the  vector is used as workspace by pcre_exec() while matching cap-         of the vector is used as workspace by pcre_exec() while  matching  cap-
1894         turing subpatterns, and is not available for passing back  information.         turing  subpatterns, and is not available for passing back information.
1895         The  length passed in ovecsize should always be a multiple of three. If         The length passed in ovecsize should always be a multiple of three.  If
1896         it is not, it is rounded down.         it is not, it is rounded down.
1897    
1898         When a match is successful, information about  captured  substrings  is         When  a  match  is successful, information about captured substrings is
1899         returned  in  pairs  of integers, starting at the beginning of ovector,         returned in pairs of integers, starting at the  beginning  of  ovector,
1900         and continuing up to two-thirds of its length at the  most.  The  first         and  continuing  up  to two-thirds of its length at the most. The first
1901         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1902         string, and the second is set to the  offset  of  the  first  character         string,  and  the  second  is  set to the offset of the first character
1903         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-
1904         tor[1], identify the portion of  the  subject  string  matched  by  the         tor[1],  identify  the  portion  of  the  subject string matched by the
1905         entire  pattern.  The next pair is used for the first capturing subpat-         entire pattern. The next pair is used for the first  capturing  subpat-
1906         tern, and so on. The value returned by pcre_exec() is one more than the         tern, and so on. The value returned by pcre_exec() is one more than the
1907         highest numbered pair that has been set. For example, if two substrings         highest numbered pair that has been set. For example, if two substrings
1908         have been captured, the returned value is 3. If there are no  capturing         have  been captured, the returned value is 3. If there are no capturing
1909         subpatterns,  the return value from a successful match is 1, indicating         subpatterns, the return value from a successful match is 1,  indicating
1910         that just the first pair of offsets has been set.         that just the first pair of offsets has been set.
1911    
1912         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1913         of the string that it matched that is returned.         of the string that it matched that is returned.
1914    
1915         If  the vector is too small to hold all the captured substring offsets,         If the vector is too small to hold all the captured substring  offsets,
1916         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1917         function  returns a value of zero. In particular, if the substring off-         function returns a value of zero. In particular, if the substring  off-
1918         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1919         as  NULL  and  ovecsize  as zero. However, if the pattern contains back         as NULL and ovecsize as zero. However, if  the  pattern  contains  back
1920         references and the ovector is not big enough to  remember  the  related         references  and  the  ovector is not big enough to remember the related
1921         substrings,  PCRE has to get additional memory for use during matching.         substrings, PCRE has to get additional memory for use during  matching.
1922         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1923    
1924         The pcre_info() function can be used to find  out  how  many  capturing         The  pcre_info()  function  can  be used to find out how many capturing
1925         subpatterns  there  are  in  a  compiled pattern. The smallest size for         subpatterns there are in a compiled  pattern.  The  smallest  size  for
1926         ovector that will allow for n captured substrings, in addition  to  the         ovector  that  will allow for n captured substrings, in addition to the
1927         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1928    
1929         It  is  possible for capturing subpattern number n+1 to match some part         It is possible for capturing subpattern number n+1 to match  some  part
1930         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
1931         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
1932         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
1933         2  is  not.  When  this happens, both values in the offset pairs corre-         2 is not. When this happens, both values in  the  offset  pairs  corre-
1934         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
1935    
1936         Offset values that correspond to unused subpatterns at the end  of  the         Offset  values  that correspond to unused subpatterns at the end of the
1937         expression  are  also  set  to  -1. For example, if the string "abc" is         expression are also set to -1. For example,  if  the  string  "abc"  is
1938         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1939         matched.  The  return  from the function is 2, because the highest used         matched. The return from the function is 2, because  the  highest  used
1940         capturing subpattern number is 1. However, you can refer to the offsets         capturing subpattern number is 1. However, you can refer to the offsets
1941         for  the  second  and third capturing subpatterns if you wish (assuming         for the second and third capturing subpatterns if  you  wish  (assuming
1942         the vector is large enough, of course).         the vector is large enough, of course).
1943    
1944         Some convenience functions are provided  for  extracting  the  captured         Some  convenience  functions  are  provided for extracting the captured
1945         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
1946    
1947     Error return values from pcre_exec()     Error return values from pcre_exec()
1948    
1949         If  pcre_exec()  fails, it returns a negative number. The following are         If pcre_exec() fails, it returns a negative number. The  following  are
1950         defined in the header file:         defined in the header file:
1951    
1952           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1937  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1955  MATCHING A PATTERN: THE TRADITIONAL FUNC
1955    
1956           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1957    
1958         Either code or subject was passed as NULL,  or  ovector  was  NULL  and         Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1959         ovecsize was not zero.         ovecsize was not zero.
1960    
1961           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1946  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1964  MATCHING A PATTERN: THE TRADITIONAL FUNC
1964    
1965           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1966    
1967         PCRE  stores a 4-byte "magic number" at the start of the compiled code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1968         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1969         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1970         an environment with the other endianness. This is the error  that  PCRE         an  environment  with the other endianness. This is the error that PCRE
1971         gives when the magic number is not present.         gives when the magic number is not present.
1972    
1973           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
1974    
1975         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1976         compiled pattern. This error could be caused by a bug  in  PCRE  or  by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1977         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1978    
1979           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1980    
1981         If  a  pattern contains back references, but the ovector that is passed         If a pattern contains back references, but the ovector that  is  passed
1982         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1983         PCRE  gets  a  block of memory at the start of matching to use for this         PCRE gets a block of memory at the start of matching to  use  for  this
1984         purpose. If the call via pcre_malloc() fails, this error is given.  The         purpose.  If the call via pcre_malloc() fails, this error is given. The
1985         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
1986    
1987           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1988    
1989         This  error is used by the pcre_copy_substring(), pcre_get_substring(),         This error is used by the pcre_copy_substring(),  pcre_get_substring(),
1990         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1991         returned by pcre_exec().         returned by pcre_exec().
1992    
1993           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1994    
1995         The  backtracking  limit,  as  specified  by the match_limit field in a         The backtracking limit, as specified by  the  match_limit  field  in  a
1996         pcre_extra structure (or defaulted) was reached.  See  the  description         pcre_extra  structure  (or  defaulted) was reached. See the description
1997         above.         above.
1998    
1999           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2000    
2001         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2002         use by callout functions that want to yield a distinctive  error  code.         use  by  callout functions that want to yield a distinctive error code.
2003         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2004    
2005           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2006    
2007         A  string  that contains an invalid UTF-8 byte sequence was passed as a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2008         subject.         subject.
2009    
2010           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2011    
2012         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
2013         value  of startoffset did not point to the beginning of a UTF-8 charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2014         ter.         ter.
2015    
2016           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2017    
2018         The subject string did not match, but it did match partially.  See  the         The  subject  string did not match, but it did match partially. See the
2019         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2020    
2021           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2022    
2023         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
2024         items that are not supported for partial matching. See the  pcrepartial         items  that are not supported for partial matching. See the pcrepartial
2025         documentation for details of partial matching.         documentation for details of partial matching.
2026    
2027           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2028    
2029         An  unexpected  internal error has occurred. This error could be caused         An unexpected internal error has occurred. This error could  be  caused
2030         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2031    
2032           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2033    
2034         This error is given if the value of the ovecsize argument is  negative.         This  error is given if the value of the ovecsize argument is negative.
2035    
2036           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2037    
2038         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2039         field in a pcre_extra structure (or defaulted)  was  reached.  See  the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2040         description above.         description above.
2041    
2042           PCRE_ERROR_NULLWSLIMIT    (-22)           PCRE_ERROR_NULLWSLIMIT    (-22)
2043    
2044         When  a  group  that  can  match an empty substring is repeated with an         When a group that can match an empty  substring  is  repeated  with  an
2045         unbounded upper limit, the subject position at the start of  the  group         unbounded  upper  limit, the subject position at the start of the group
2046         must be remembered, so that a test for an empty string can be made when         must be remembered, so that a test for an empty string can be made when
2047         the end of the group is reached. Some workspace is required  for  this;         the  end  of the group is reached. Some workspace is required for this;
2048         if it runs out, this error is given.         if it runs out, this error is given.
2049    
2050           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2049  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2067  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2067         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2068              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2069    
2070         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2071         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2072         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2073         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2074         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2075         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2076         substrings.         substrings.
2077    
2078         A  substring that contains a binary zero is correctly extracted and has         A substring that contains a binary zero is correctly extracted and  has
2079         a further zero added on the end, but the result is not, of course, a  C         a  further zero added on the end, but the result is not, of course, a C
2080         string.   However,  you  can  process such a string by referring to the         string.  However, you can process such a string  by  referring  to  the
2081         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2082         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2083         not adequate for handling strings containing binary zeros, because  the         not  adequate for handling strings containing binary zeros, because the
2084         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2085    
2086         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2087         tions: subject is the subject string that has  just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2088         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2089         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2090         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2091         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2092         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2093         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2094         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2095    
2096         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2097         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2098         zero  extracts  the  substring that matched the entire pattern, whereas         zero extracts the substring that matched the  entire  pattern,  whereas
2099         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2100         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2101         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2102         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2103         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2104         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2105    
2106           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2107    
2108         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2109         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2110    
2111           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2112    
2113         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2114    
2115         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2116         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2117         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2118         the  memory  block  is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2119         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2120         pointer.  The  yield  of  the function is zero if all went well, or the         pointer. The yield of the function is zero if all  went  well,  or  the
2121         error code         error code
2122    
2123           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2124    
2125         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2126    
2127         When any of these functions encounter a substring that is unset,  which         When  any of these functions encounter a substring that is unset, which
2128         can  happen  when  capturing subpattern number n+1 matches some part of         can happen when capturing subpattern number n+1 matches  some  part  of
2129         the subject, but subpattern n has not been used at all, they return  an         the  subject, but subpattern n has not been used at all, they return an
2130         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2131         string by inspecting the appropriate offset in ovector, which is  nega-         string  by inspecting the appropriate offset in ovector, which is nega-
2132         tive for unset substrings.         tive for unset substrings.
2133    
2134         The  two convenience functions pcre_free_substring() and pcre_free_sub-         The two convenience functions pcre_free_substring() and  pcre_free_sub-
2135         string_list() can be used to free the memory  returned  by  a  previous         string_list()  can  be  used  to free the memory returned by a previous
2136         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2137         tively. They do nothing more than  call  the  function  pointed  to  by         tively.  They  do  nothing  more  than  call the function pointed to by
2138         pcre_free,  which  of course could be called directly from a C program.         pcre_free, which of course could be called directly from a  C  program.
2139         However, PCRE is used in some situations where it is linked via a  spe-         However,  PCRE is used in some situations where it is linked via a spe-
2140         cial   interface  to  another  programming  language  that  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
2141         pcre_free directly; it is for these cases that the functions  are  pro-         pcre_free  directly;  it is for these cases that the functions are pro-
2142         vided.         vided.
2143    
2144    
# Line 2139  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2157  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2157              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2158              const char **stringptr);              const char **stringptr);
2159    
2160         To  extract a substring by name, you first have to find associated num-         To extract a substring by name, you first have to find associated  num-
2161         ber.  For example, for this pattern         ber.  For example, for this pattern
2162    
2163           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2148  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2166  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2166         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2167         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2168         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2169         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2170         subpattern of that name.         subpattern of that name.
2171    
2172         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2173         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2174         are also two functions that do the whole job.         are also two functions that do the whole job.
2175    
2176         Most    of    the    arguments   of   pcre_copy_named_substring()   and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2177         pcre_get_named_substring() are the same  as  those  for  the  similarly         pcre_get_named_substring()  are  the  same  as  those for the similarly
2178         named  functions  that extract by number. As these are described in the         named functions that extract by number. As these are described  in  the
2179         previous section, they are not re-described here. There  are  just  two         previous  section,  they  are not re-described here. There are just two
2180         differences:         differences:
2181    
2182         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2183         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2184         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2185         name-to-number translation table.         name-to-number translation table.
2186    
2187         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2188         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2189         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2190         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
2191    
2192    
# Line 2177  DUPLICATE SUBPATTERN NAMES Line 2195  DUPLICATE SUBPATTERN NAMES
2195         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2196              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2197    
2198         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2199         subpatterns are not required to  be  unique.  Normally,  patterns  with         subpatterns  are  not  required  to  be unique. Normally, patterns with
2200         duplicate  names  are such that in any one match, only one of the named         duplicate names are such that in any one match, only one of  the  named
2201         subpatterns participates. An example is shown in the pcrepattern  docu-         subpatterns  participates. An example is shown in the pcrepattern docu-
2202         mentation. When duplicates are present, pcre_copy_named_substring() and         mentation. When duplicates are present, pcre_copy_named_substring() and
2203         pcre_get_named_substring() return the first substring corresponding  to         pcre_get_named_substring()  return the first substring corresponding to
2204         the  given  name  that  is  set.  If  none  are set, an empty string is         the given name that is set.  If  none  are  set,  an  empty  string  is
2205         returned.  The pcre_get_stringnumber() function returns one of the num-         returned.  The pcre_get_stringnumber() function returns one of the num-
2206         bers  that are associated with the name, but it is not defined which it         bers that are associated with the name, but it is not defined which  it
2207         is.         is.
2208    
2209         If you want to get full details of all captured substrings for a  given         If  you want to get full details of all captured substrings for a given
2210         name,  you  must  use  the pcre_get_stringtable_entries() function. The         name, you must use  the  pcre_get_stringtable_entries()  function.  The
2211         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2212         third  and  fourth  are  pointers to variables which are updated by the         third and fourth are pointers to variables which  are  updated  by  the
2213         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2214         the  name-to-number  table  for  the  given  name.  The function itself         the name-to-number table  for  the  given  name.  The  function  itself
2215         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2216         there  are none. The format of the table is described above in the sec-         there are none. The format of the table is described above in the  sec-
2217         tion entitled Information about a  pattern.   Given  all  the  relevant         tion  entitled  Information  about  a  pattern.  Given all the relevant
2218         entries  for the name, you can extract each of their numbers, and hence         entries for the name, you can extract each of their numbers, and  hence
2219         the captured data, if any.         the captured data, if any.
2220    
2221    
2222  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2223    
2224         The traditional matching function uses a  similar  algorithm  to  Perl,         The  traditional  matching  function  uses a similar algorithm to Perl,
2225         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2226         the subject. If you want to find all possible matches, or  the  longest         the  subject.  If you want to find all possible matches, or the longest
2227         possible  match,  consider using the alternative matching function (see         possible match, consider using the alternative matching  function  (see
2228         below) instead. If you cannot use the alternative function,  but  still         below)  instead.  If you cannot use the alternative function, but still
2229         need  to  find all possible matches, you can kludge it up by making use         need to find all possible matches, you can kludge it up by  making  use
2230         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2231         tation.         tation.
2232    
2233         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2234         tern.  When your callout function is called, extract and save the  cur-         tern.   When your callout function is called, extract and save the cur-
2235         rent  matched  substring.  Then  return  1, which forces pcre_exec() to         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2236         backtrack and try other alternatives. Ultimately, when it runs  out  of         backtrack  and  try other alternatives. Ultimately, when it runs out of
2237         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2238    
2239    
# Line 2226  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2244  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2244              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2245              int *workspace, int wscount);              int *workspace, int wscount);
2246    
2247         The  function  pcre_dfa_exec()  is  called  to  match  a subject string         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2248         against a compiled pattern, using a matching algorithm that  scans  the         against  a  compiled pattern, using a matching algorithm that scans the
2249         subject  string  just  once, and does not backtrack. This has different         subject string just once, and does not backtrack.  This  has  different
2250         characteristics to the normal algorithm, and  is  not  compatible  with         characteristics  to  the  normal  algorithm, and is not compatible with
2251         Perl.  Some  of the features of PCRE patterns are not supported. Never-         Perl. Some of the features of PCRE patterns are not  supported.  Never-
2252         theless, there are times when this kind of matching can be useful.  For         theless,  there are times when this kind of matching can be useful. For
2253         a discussion of the two matching algorithms, see the pcrematching docu-         a discussion of the two matching algorithms, see the pcrematching docu-
2254         mentation.         mentation.
2255    
2256         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2257         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2258         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2259         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2260         repeated here.         repeated here.
2261    
2262         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2263         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2264         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2265         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2266         lot of potential matches.         lot of potential matches.
2267    
2268         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2266  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2284  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2284    
2285     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2286    
2287         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2288         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2289         LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2290         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2291         three of these are the same as for pcre_exec(), so their description is         three of these are the same as for pcre_exec(), so their description is
2292         not repeated here.         not repeated here.
2293    
2294           PCRE_PARTIAL           PCRE_PARTIAL
2295    
2296         This  has  the  same general effect as it does for pcre_exec(), but the         This has the same general effect as it does for  pcre_exec(),  but  the
2297         details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
2298         pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
2299         PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have         PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
2300         been no complete matches, but there is still at least one matching pos-         been no complete matches, but there is still at least one matching pos-
2301         sibility. The portion of the string that provided the partial match  is         sibility.  The portion of the string that provided the partial match is
2302         set as the first matching string.         set as the first matching string.
2303    
2304           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2305    
2306         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2307         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2308         tive  algorithm  works, this is necessarily the shortest possible match         tive algorithm works, this is necessarily the shortest  possible  match
2309         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2310    
2311           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2312    
2313         When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
2314         returns  a  partial  match, it is possible to call it again, with addi-         returns a partial match, it is possible to call it  again,  with  addi-
2315         tional subject characters, and have it continue with  the  same  match.         tional  subject  characters,  and have it continue with the same match.
2316         The  PCRE_DFA_RESTART  option requests this action; when it is set, the         The PCRE_DFA_RESTART option requests this action; when it is  set,  the
2317         workspace and wscount options must reference the same vector as  before         workspace  and wscount options must reference the same vector as before
2318         because  data  about  the  match so far is left in them after a partial         because data about the match so far is left in  them  after  a  partial
2319         match. There is more discussion of this  facility  in  the  pcrepartial         match.  There  is  more  discussion of this facility in the pcrepartial
2320         documentation.         documentation.
2321    
2322     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2323    
2324         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
2325         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
2326         of  the  function  start  at the same point in the subject. The shorter         of the function start at the same point in  the  subject.  The  shorter
2327         matches are all initial substrings of the longer matches. For  example,         matches  are all initial substrings of the longer matches. For example,
2328         if the pattern         if the pattern
2329    
2330           <.*>           <.*>
# Line 2321  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2339  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2339           <something> <something else>           <something> <something else>
2340           <something> <something else> <something further>           <something> <something else> <something further>
2341    
2342         On  success,  the  yield of the function is a number greater than zero,         On success, the yield of the function is a number  greater  than  zero,
2343         which is the number of matched substrings.  The  substrings  themselves         which  is  the  number of matched substrings. The substrings themselves
2344         are  returned  in  ovector. Each string uses two elements; the first is         are returned in ovector. Each string uses two elements;  the  first  is
2345         the offset to the start, and the second is the offset to  the  end.  In         the  offset  to  the start, and the second is the offset to the end. In
2346         fact,  all  the  strings  have the same start offset. (Space could have         fact, all the strings have the same start  offset.  (Space  could  have
2347         been saved by giving this only once, but it was decided to retain  some         been  saved by giving this only once, but it was decided to retain some
2348         compatibility  with  the  way pcre_exec() returns data, even though the         compatibility with the way pcre_exec() returns data,  even  though  the
2349         meaning of the strings is different.)         meaning of the strings is different.)
2350    
2351         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2352         est  matching  string is given first. If there were too many matches to         est matching string is given first. If there were too many  matches  to
2353         fit into ovector, the yield of the function is zero, and the vector  is         fit  into ovector, the yield of the function is zero, and the vector is
2354         filled with the longest matches.         filled with the longest matches.
2355    
2356     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2357    
2358         The  pcre_dfa_exec()  function returns a negative number when it fails.         The pcre_dfa_exec() function returns a negative number when  it  fails.
2359         Many of the errors are the same  as  for  pcre_exec(),  and  these  are         Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2360         described  above.   There are in addition the following errors that are         described above.  There are in addition the following errors  that  are
2361         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2362    
2363           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2364    
2365         This return is given if pcre_dfa_exec() encounters an item in the  pat-         This  return is given if pcre_dfa_exec() encounters an item in the pat-
2366         tern  that  it  does not support, for instance, the use of \C or a back         tern that it does not support, for instance, the use of \C  or  a  back
2367         reference.         reference.
2368    
2369           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2370    
2371         This return is given if pcre_dfa_exec()  encounters  a  condition  item         This  return  is  given  if pcre_dfa_exec() encounters a condition item
2372         that  uses  a back reference for the condition, or a test for recursion         that uses a back reference for the condition, or a test  for  recursion
2373         in a specific group. These are not supported.         in a specific group. These are not supported.
2374    
2375           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2376    
2377         This return is given if pcre_dfa_exec() is called with an  extra  block         This  return  is given if pcre_dfa_exec() is called with an extra block
2378         that contains a setting of the match_limit field. This is not supported         that contains a setting of the match_limit field. This is not supported
2379         (it is meaningless).         (it is meaningless).
2380    
2381           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2382    
2383         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2384         workspace vector.         workspace vector.
2385    
2386           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
2387    
2388         When  a  recursive subpattern is processed, the matching function calls         When a recursive subpattern is processed, the matching  function  calls
2389         itself recursively, using private vectors for  ovector  and  workspace.         itself  recursively,  using  private vectors for ovector and workspace.
2390         This  error  is  given  if  the output vector is not large enough. This         This error is given if the output vector  is  not  large  enough.  This
2391         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2392    
2393    
2394  SEE ALSO  SEE ALSO
2395    
2396         pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2397         tial(3),  pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).
2398    
2399    
2400  AUTHOR  AUTHOR
# Line 2388  AUTHOR Line 2406  AUTHOR
2406    
2407  REVISION  REVISION
2408    
2409         Last updated: 24 April 2007         Last updated: 04 June 2007
2410         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2411  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2412    
# Line 2491  THE CALLOUT INTERFACE Line 2509  THE CALLOUT INTERFACE
2509         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2510         were passed to pcre_exec().         were passed to pcre_exec().
2511    
2512         The start_match field contains the offset within the subject  at  which         The start_match field normally contains the offset within  the  subject
2513         the  current match attempt started. If the pattern is not anchored, the         at  which  the  current  match  attempt started. However, if the escape
2514         callout function may be called several times from the same point in the         sequence \K has been encountered, this value is changed to reflect  the
2515         pattern for different starting points in the subject.         modified  starting  point.  If the pattern is not anchored, the callout
2516           function may be called several times from the same point in the pattern
2517           for different starting points in the subject.
2518    
2519         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
2520         the current match pointer.         the current match pointer.
# Line 2557  AUTHOR Line 2577  AUTHOR
2577    
2578  REVISION  REVISION
2579    
2580         Last updated: 06 March 2007         Last updated: 29 May 2007
2581         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2582  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2583    
# Line 2718  PCRE REGULAR EXPRESSION DETAILS Line 2738  PCRE REGULAR EXPRESSION DETAILS
2738         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2739         From  release  6.0,   PCRE   offers   a   second   matching   function,         From  release  6.0,   PCRE   offers   a   second   matching   function,
2740         pcre_dfa_exec(),  which matches using a different algorithm that is not         pcre_dfa_exec(),  which matches using a different algorithm that is not
2741         Perl-compatible. The advantages and disadvantages  of  the  alternative         Perl-compatible. Some of the features discussed below are not available
2742         function, and how it differs from the normal function, are discussed in         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2743         the pcrematching page.         alternative function, and how it differs from the normal function,  are
2744           discussed in the pcrematching page.
2745    
2746    
2747  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
2748    
2749         A regular expression is a pattern that is  matched  against  a  subject         A  regular  expression  is  a pattern that is matched against a subject
2750         string  from  left  to right. Most characters stand for themselves in a         string from left to right. Most characters stand for  themselves  in  a
2751         pattern, and match the corresponding characters in the  subject.  As  a         pattern,  and  match  the corresponding characters in the subject. As a
2752         trivial example, the pattern         trivial example, the pattern
2753    
2754           The quick brown fox           The quick brown fox
2755    
2756         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2757         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2758         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2759         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2760         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2761         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2762         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2763         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2764         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2765    
2766         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2767         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2768         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2769         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2770    
2771         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2772         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2773         that are recognized within square brackets.  Outside  square  brackets,         that  are  recognized  within square brackets. Outside square brackets,
2774         the metacharacters are as follows:         the metacharacters are as follows:
2775    
2776           \      general escape character with several uses           \      general escape character with several uses
# Line 2768  CHARACTERS AND METACHARACTERS Line 2789  CHARACTERS AND METACHARACTERS
2789                  also "possessive quantifier"                  also "possessive quantifier"
2790           {      start min/max quantifier           {      start min/max quantifier
2791    
2792         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2793         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2794    
2795           \      general escape character           \      general escape character
# Line 2778  CHARACTERS AND METACHARACTERS Line 2799  CHARACTERS AND METACHARACTERS
2799                    syntax)                    syntax)
2800           ]      terminates the character class           ]      terminates the character class
2801    
2802         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2803    
2804    
2805  BACKSLASH  BACKSLASH
2806    
2807         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2808         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2809         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2810         applies both inside and outside character classes.         applies both inside and outside character classes.
2811    
2812         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2813         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2814         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2815         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2816         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2817         slash, you write \\.         slash, you write \\.
2818    
2819         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2820         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2821         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
2822         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
2823         part of the pattern.         part of the pattern.
2824    
2825         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2826         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2827         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2828         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
2829         tion. Note the following examples:         tion. Note the following examples:
2830    
2831           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2814  BACKSLASH Line 2835  BACKSLASH
2835           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2836           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2837    
2838         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2839         classes.         classes.
2840    
2841     Non-printing characters     Non-printing characters
2842    
2843         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2844         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
2845         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
2846         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
2847         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
2848         sequences than the binary character it represents:         sequences than the binary character it represents:
2849    
2850           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 2837  BACKSLASH Line 2858  BACKSLASH
2858           \xhh      character with hex code hh           \xhh      character with hex code hh
2859           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
2860    
2861         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
2862         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
2863         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
2864         becomes hex 7B.         becomes hex 7B.
2865    
2866         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
2867         in  upper  or  lower case). Any number of hexadecimal digits may appear         in upper or lower case). Any number of hexadecimal  digits  may  appear
2868         between \x{ and }, but the value of the character  code  must  be  less         between  \x{  and  },  but the value of the character code must be less
2869         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2870         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than         the  maximum  hexadecimal  value is 7FFFFFFF). If characters other than
2871         hexadecimal  digits  appear between \x{ and }, or if there is no termi-         hexadecimal digits appear between \x{ and }, or if there is  no  termi-
2872         nating }, this form of escape is not recognized.  Instead, the  initial         nating  }, this form of escape is not recognized.  Instead, the initial
2873         \x will be interpreted as a basic hexadecimal escape, with no following         \x will be interpreted as a basic hexadecimal escape, with no following
2874         digits, giving a character whose value is zero.         digits, giving a character whose value is zero.
2875    
2876         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2877         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x. There is no difference in the way  they  are  han-
2878         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
2879    
2880         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
2881         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
2882         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2883         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
2884         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
2885    
2886         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2887         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2888         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
2889         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2890         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
2891         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
2892         of parenthesized subpatterns.         of parenthesized subpatterns.
2893    
2894         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
2895         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
2896         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
2897         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves.  In
2898         non-UTF-8 mode, the value of a character specified  in  octal  must  be         non-UTF-8  mode,  the  value  of a character specified in octal must be
2899         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
2900         example:         example:
2901    
2902           \040   is another way of writing a space           \040   is another way of writing a space
# Line 2893  BACKSLASH Line 2914  BACKSLASH
2914           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2915                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2916    
2917         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
2918         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2919    
2920         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
2921         inside and outside character classes. In addition, inside  a  character         inside  and  outside character classes. In addition, inside a character
2922         class,  the  sequence \b is interpreted as the backspace character (hex         class, the sequence \b is interpreted as the backspace  character  (hex
2923         08), and the sequences \R and \X are interpreted as the characters  "R"         08),  and the sequences \R and \X are interpreted as the characters "R"
2924         and  "X", respectively. Outside a character class, these sequences have         and "X", respectively. Outside a character class, these sequences  have
2925         different meanings (see below).         different meanings (see below).
2926    
2927     Absolute and relative back references     Absolute and relative back references
2928    
2929         The sequence \g followed by a positive or negative  number,  optionally         The  sequence  \g followed by a positive or negative number, optionally
2930         enclosed  in  braces,  is  an absolute or relative back reference. Back         enclosed in braces, is an absolute or relative back reference. A  named
2931         references are discussed later, following the discussion  of  parenthe-         back  reference can be coded as \g{name}. Back references are discussed
2932         sized subpatterns.         later, following the discussion of parenthesized subpatterns.
2933    
2934     Generic character types     Generic character types
2935    
# Line 2923  BACKSLASH Line 2944  BACKSLASH
2944           \W     any "non-word" character           \W     any "non-word" character
2945    
2946         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2947         into  two disjoint sets. Any given character matches one, and only one,         into two disjoint sets. Any given character matches one, and only  one,
2948         of each pair.         of each pair.
2949    
2950         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2951         acter  classes.  They each match one character of the appropriate type.         acter classes. They each match one character of the  appropriate  type.
2952         If the current matching point is at the end of the subject string,  all         If  the current matching point is at the end of the subject string, all
2953         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2954    
2955         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
2956         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
2957         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If         characters are HT (9), LF (10), FF (12), CR (13), and space  (32).  (If
2958         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
2959         ter. In PCRE, it never does.)         ter. In PCRE, it never does.)
2960    
2961         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
2962         is a letter or digit. The definition of  letters  and  digits  is  con-         is  a  letter  or  digit.  The definition of letters and digits is con-
2963         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
2964         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
2965         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
2966         systems, or "french" in Windows, some character codes greater than  128         systems,  or "french" in Windows, some character codes greater than 128
2967         are used for accented letters, and these are matched by \w.         are used for accented letters, and these are matched by \w.
2968    
2969         In  UTF-8 mode, characters with values greater than 128 never match \d,         In UTF-8 mode, characters with values greater than 128 never match  \d,
2970         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2971         code  character  property support is available. The use of locales with         code character property support is available. The use of  locales  with
2972         Unicode is discouraged.         Unicode is discouraged.
2973    
2974     Newline sequences     Newline sequences
2975    
2976         Outside a character class, the escape sequence \R matches  any  Unicode         Outside  a  character class, the escape sequence \R matches any Unicode
2977         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is
2978         equivalent to the following:         equivalent to the following:
2979    
2980           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
2981    
2982         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
2983         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
2984         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
2985         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
2986         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
2987         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
2988    
2989         In  UTF-8  mode, two additional characters whose codepoints are greater         In UTF-8 mode, two additional characters whose codepoints  are  greater
2990         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
2991         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
2992         these characters to be recognized.         these characters to be recognized.
2993    
2994         Inside a character class, \R matches the letter "R".         Inside a character class, \R matches the letter "R".
# Line 2975  BACKSLASH Line 2996  BACKSLASH
2996     Unicode character properties     Unicode character properties
2997    
2998         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
2999         tional  escape  sequences  to  match character properties are available         tional escape sequences to match  character  properties  are  available
3000         when UTF-8 mode is selected. They are:         when UTF-8 mode is selected. They are:
3001    
3002           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3003           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3004           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3005    
3006         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
3007         script names, the general category properties, and "Any", which matches         script names, the general category properties, and "Any", which matches
3008         any character (including newline). Other properties such as "InMusical-         any character (including newline). Other properties such as "InMusical-
3009         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does         Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
3010         not match any characters, so always causes a match failure.         not match any characters, so always causes a match failure.
3011    
3012         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3013         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3014         For example:         For example:
3015    
3016           \p{Greek}           \p{Greek}
3017           \P{Han}           \P{Han}
3018    
3019         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3020         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3021    
3022         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
3023         Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,         Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
3024         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3025         Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-         Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3026         gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,         gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
3027         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
3028         Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,         Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
3029         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
3030         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3031    
3032         Each  character has exactly one general category property, specified by         Each character has exactly one general category property, specified  by
3033         a two-letter abbreviation. For compatibility with Perl, negation can be         a two-letter abbreviation. For compatibility with Perl, negation can be
3034         specified  by  including a circumflex between the opening brace and the         specified by including a circumflex between the opening brace  and  the
3035         property name. For example, \p{^Lu} is the same as \P{Lu}.         property name. For example, \p{^Lu} is the same as \P{Lu}.
3036    
3037         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3038         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3039         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3040         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3041    
3042           \p{L}           \p{L}
# Line 3067  BACKSLASH Line 3088  BACKSLASH
3088           Zp    Paragraph separator           Zp    Paragraph separator
3089           Zs    Space separator           Zs    Space separator
3090    
3091         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3092         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3093         classified as a modifier or "other".         classified as a modifier or "other".
3094    
3095         The  long  synonyms  for  these  properties that Perl supports (such as         The long synonyms for these properties  that  Perl  supports  (such  as
3096         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3097         any of these properties with "Is".         any of these properties with "Is".
3098    
3099         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3100         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3101         in the Unicode table.         in the Unicode table.
3102    
3103         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3104         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3105    
3106         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3107         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3108    
3109           (?>\PM\pM*)           (?>\PM\pM*)
3110    
3111         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3112         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3113         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3114         property are typically accents that affect the preceding character.         property are typically accents that affect the preceding character.
3115    
3116         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3117         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3118         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3119         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
3120    
3121       Resetting the match start
3122    
3123           The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3124           ously  matched  characters  not  to  be  included  in the final matched
3125           sequence. For example, the pattern:
3126    
3127             foo\Kbar
3128    
3129           matches "foobar", but reports that it has matched "bar".  This  feature
3130           is  similar  to  a lookbehind assertion (described below).  However, in
3131           this case, the part of the subject before the real match does not  have
3132           to  be of fixed length, as lookbehind assertions do. The use of \K does
3133           not interfere with the setting of captured  substrings.   For  example,
3134           when the pattern
3135    
3136             (foo)\Kbar
3137    
3138           matches "foobar", the first substring is still set to "foo".
3139    
3140     Simple assertions     Simple assertions
3141    
3142         The  final use of backslash is for certain simple assertions. An asser-         The  final use of backslash is for certain simple assertions. An asser-
# Line 3858  BACK REFERENCES Line 3898  BACK REFERENCES
3898         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3899         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3900    
3901         Back references to named subpatterns use the Perl  syntax  \k<name>  or         There are several different ways of writing back  references  to  named
3902         \k'name'  or  the  Python  syntax (?P=name). We could rewrite the above         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
3903         example in either of the following ways:         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
3904           unified back reference syntax, in which \g can be used for both numeric
3905           and named references, is also supported. We  could  rewrite  the  above
3906           example in any of the following ways:
3907    
3908           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
3909             (?'p1'(?i)rah)\s+\k{p1}
3910           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
3911             (?<p1>(?i)rah)\s+\g{p1}
3912    
3913         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
3914         before or after the reference.         before or after the reference.
3915    
3916         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
3917         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
3918         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
3919    
3920           (a|(bc))\2           (a|(bc))\2
3921    
3922         always  fails if it starts to match "a" rather than "bc". Because there         always fails if it starts to match "a" rather than "bc". Because  there
3923         may be many capturing parentheses in a pattern,  all  digits  following         may  be  many  capturing parentheses in a pattern, all digits following
3924         the  backslash  are taken as part of a potential back reference number.         the backslash are taken as part of a potential back  reference  number.
3925         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
3926         used  to  terminate  the back reference. If the PCRE_EXTENDED option is         used to terminate the back reference. If the  PCRE_EXTENDED  option  is
3927         set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
3928         ments" below) can be used.         ments" below) can be used.
3929    
3930         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
3931         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
3932         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
3933         patterns. For example, the pattern         patterns. For example, the pattern
3934    
3935           (a|b\1)+           (a|b\1)+
3936    
3937         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
3938         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
3939         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
3940         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
3941         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
3942         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
3943    
3944    
3945  ASSERTIONS  ASSERTIONS
3946    
3947         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
3948         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
3949         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
3950         described above.         described above.
3951    
3952         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
3953         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
3954         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
3955         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
3956         matching position to be changed.         matching position to be changed.
3957    
3958         Assertion subpatterns are not capturing subpatterns,  and  may  not  be         Assertion  subpatterns  are  not  capturing subpatterns, and may not be
3959         repeated,  because  it  makes no sense to assert the same thing several         repeated, because it makes no sense to assert the  same  thing  several
3960         times. If any kind of assertion contains capturing  subpatterns  within         times.  If  any kind of assertion contains capturing subpatterns within
3961         it,  these are counted for the purposes of numbering the capturing sub-         it, these are counted for the purposes of numbering the capturing  sub-
3962         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
3963         out  only  for  positive assertions, because it does not make sense for         out only for positive assertions, because it does not  make  sense  for
3964         negative assertions.         negative assertions.
3965    
3966     Lookahead assertions     Lookahead assertions
# Line 3925  ASSERTIONS Line 3970  ASSERTIONS
3970    
3971           \w+(?=;)           \w+(?=;)
3972    
3973         matches  a word followed by a semicolon, but does not include the semi-         matches a word followed by a semicolon, but does not include the  semi-
3974         colon in the match, and         colon in the match, and
3975    
3976           foo(?!bar)           foo(?!bar)
3977    
3978         matches any occurrence of "foo" that is not  followed  by  "bar".  Note         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
3979         that the apparently similar pattern         that the apparently similar pattern
3980    
3981           (?!foo)bar           (?!foo)bar
3982    
3983         does  not  find  an  occurrence  of "bar" that is preceded by something         does not find an occurrence of "bar"  that  is  preceded  by  something
3984         other than "foo"; it finds any occurrence of "bar" whatsoever,  because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
3985         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
3986         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
3987    
3988         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
3989         most  convenient  way  to  do  it  is with (?!) because an empty string         most convenient way to do it is  with  (?!)  because  an  empty  string
3990         always matches, so an assertion that requires there not to be an  empty         always  matches, so an assertion that requires there not to be an empty
3991         string must always fail.         string must always fail.
3992    
3993     Lookbehind assertions     Lookbehind assertions
3994    
3995         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
3996         for negative assertions. For example,         for negative assertions. For example,
3997    
3998           (?<!foo)bar           (?<!foo)bar
3999    
4000         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4001         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4002         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4003         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4004         fixed length. Thus         fixed length. Thus
4005    
4006           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 3964  ASSERTIONS Line 4009  ASSERTIONS
4009    
4010           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4011    
4012         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4013         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4014         This is an extension compared with  Perl  (at  least  for  5.8),  which         This  is  an  extension  compared  with  Perl (at least for 5.8), which
4015         requires  all branches to match the same length of string. An assertion         requires all branches to match the same length of string. An  assertion
4016         such as         such as
4017    
4018           (?<=ab(c|de))           (?<=ab(c|de))
4019    
4020         is not permitted, because its single top-level  branch  can  match  two         is  not  permitted,  because  its single top-level branch can match two
4021         different  lengths,  but  it is acceptable if rewritten to use two top-         different lengths, but it is acceptable if rewritten to  use  two  top-
4022         level branches:         level branches:
4023    
4024           (?<=abc|abde)           (?<=abc|abde)
4025    
4026         The implementation of lookbehind assertions is, for  each  alternative,         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4027         to  temporarily  move the current position back by the fixed length and         instead of a lookbehind assertion; this is not restricted to  a  fixed-
4028           length.
4029    
4030           The  implementation  of lookbehind assertions is, for each alternative,
4031           to temporarily move the current position back by the fixed  length  and
4032         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4033         rent position, the assertion fails.         rent position, the assertion fails.
4034    
4035         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4036         mode) to appear in lookbehind assertions, because it makes it  impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
4037         ble  to  calculate the length of the lookbehind. The \X and \R escapes,         ble to calculate the length of the lookbehind. The \X and  \R  escapes,
4038         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
4039    
4040         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
4041         assertions  to  specify  efficient  matching  at the end of the subject         assertions to specify efficient matching at  the  end  of  the  subject
4042         string. Consider a simple pattern such as         string. Consider a simple pattern such as
4043    
4044           abcd$           abcd$
4045    
4046         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
4047         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4048         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
4049         pattern is specified as         pattern is specified as
4050    
4051           ^.*abcd$           ^.*abcd$
4052    
4053         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
4054         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4055         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
4056         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
4057         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4058    
4059           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4060    
4061         there  can  be  no backtracking for the .*+ item; it can match only the         there can be no backtracking for the .*+ item; it can  match  only  the
4062         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
4063         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
4064         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
4065         processing time.         processing time.
4066    
4067     Using multiple assertions     Using multiple assertions
# Line 4021  ASSERTIONS Line 4070  ASSERTIONS
4070    
4071           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4072    
4073         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
4074         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
4075         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
4076         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
4077         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4078         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
4079         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
4080         foo". A pattern to do that is         foo". A pattern to do that is
4081    
4082           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4083    
4084         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
4085         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4086         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4087    
# Line 4040  ASSERTIONS Line 4089  ASSERTIONS
4089    
4090           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4091    
4092         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
4093         is not preceded by "foo", while         is not preceded by "foo", while
4094    
4095           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4096    
4097         is  another pattern that matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
4098         three characters that are not "999".         three characters that are not "999".
4099    
4100    
4101  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4102    
4103         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
4104         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
4105         on the result of an assertion, or whether a previous capturing  subpat-         on  the result of an assertion, or whether a previous capturing subpat-
4106         tern  matched  or not. The two possible forms of conditional subpattern         tern matched or not. The two possible forms of  conditional  subpattern
4107         are         are
4108    
4109           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4110           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4111    
4112         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
4113         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4114         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4115    
4116         There are four kinds of condition: references  to  subpatterns,  refer-         There  are  four  kinds of condition: references to subpatterns, refer-
4117         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4118    
4119     Checking for a used subpattern by number     Checking for a used subpattern by number
4120    
4121         If  the  text between the parentheses consists of a sequence of digits,         If the text between the parentheses consists of a sequence  of  digits,
4122         the condition is true if the capturing subpattern of  that  number  has         the  condition  is  true if the capturing subpattern of that number has
4123         previously matched.         previously matched. An alternative notation is to  precede  the  digits
4124           with a plus or minus sign. In this case, the subpattern number is rela-
4125           tive rather than absolute.  The most recently opened parentheses can be
4126           referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
4127           looping constructs it can also make sense to refer to subsequent groups
4128           with constructs such as (?(+2).
4129    
4130         Consider  the  following  pattern, which contains non-significant white         Consider  the  following  pattern, which contains non-significant white
4131         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
# Line 4090  CONDITIONAL SUBPATTERNS Line 4144  CONDITIONAL SUBPATTERNS
4144         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
4145         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4146    
4147           If you were embedding this pattern in a larger one,  you  could  use  a
4148           relative reference:
4149    
4150             ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4151    
4152           This  makes  the  fragment independent of the parentheses in the larger
4153           pattern.
4154    
4155     Checking for a used subpattern by name     Checking for a used subpattern by name
4156    
4157         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
# Line 4231  RECURSIVE PATTERNS Line 4293  RECURSIVE PATTERNS
4293           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4294    
4295         We  have  put the pattern into parentheses, and caused the recursion to         We  have  put the pattern into parentheses, and caused the recursion to
4296         refer to them instead of the whole pattern. In a larger pattern,  keep-         refer to them instead of the whole pattern.
4297         ing  track  of parenthesis numbers can be tricky. It may be more conve-  
4298         nient to use named parentheses instead. The Perl  syntax  for  this  is         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
4299         (?&name);  PCRE's  earlier syntax (?P>name) is also supported. We could         tricky.  This is made easier by the use of relative references. (A Perl
4300         rewrite the above example as follows:         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
4301           (?-2) to refer to the second most recently opened parentheses preceding
4302           the recursion. In other  words,  a  negative  number  counts  capturing
4303           parentheses leftwards from the point at which it is encountered.
4304    
4305           It  is  also  possible  to refer to subsequently opened parentheses, by
4306           writing references such as (?+2). However, these  cannot  be  recursive
4307           because  the  reference  is  not inside the parentheses that are refer-
4308           enced. They are always "subroutine" calls, as  described  in  the  next
4309           section.
4310    
4311           An  alternative  approach is to use named parentheses instead. The Perl
4312           syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
4313           supported. We could rewrite the above example as follows:
4314    
4315           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4316    
4317         If there is more than one subpattern with the same name,  the  earliest         If  there  is more than one subpattern with the same name, the earliest
4318         one  is used. This particular example pattern contains nested unlimited         one is used.
4319         repeats, and so the use of atomic grouping for matching strings of non-  
4320         parentheses  is  important when applying the pattern to strings that do         This particular example pattern that we have been looking  at  contains
4321         not match. For example, when this pattern is applied to         nested  unlimited repeats, and so the use of atomic grouping for match-
4322           ing strings of non-parentheses is important when applying  the  pattern
4323           to strings that do not match. For example, when this pattern is applied
4324           to
4325    
4326           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4327    
# Line 4293  SUBPATTERNS AS SUBROUTINES Line 4371  SUBPATTERNS AS SUBROUTINES
4371         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4372         by  name)  is used outside the parentheses to which it refers, it oper-         by  name)  is used outside the parentheses to which it refers, it oper-
4373         ates like a subroutine in a programming language. The "called"  subpat-         ates like a subroutine in a programming language. The "called"  subpat-
4374         tern  may  be defined before or after the reference. An earlier example         tern may be defined before or after the reference. A numbered reference
4375         pointed out that the pattern         can be absolute or relative, as in these examples:
4376    
4377             (...(absolute)...)...(?2)...
4378             (...(relative)...)...(?-1)...
4379             (...(?+1)...(relative)...
4380    
4381           An earlier example pointed out that the pattern
4382    
4383           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4384    
# Line 4316  SUBPATTERNS AS SUBROUTINES Line 4400  SUBPATTERNS AS SUBROUTINES
4400         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4401         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4402    
4403           (abc)(?i:(?1))           (abc)(?i:(?-1))
4404    
4405         It matches "abcabc". It does not match "abcABC" because the  change  of         It matches "abcabc". It does not match "abcABC" because the  change  of
4406         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
# Line 4371  AUTHOR Line 4455  AUTHOR
4455    
4456  REVISION  REVISION
4457    
4458         Last updated: 06 March 2007         Last updated: 29 May 2007
4459         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4460  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4461    
# Line 4452  RESTRICTED PATTERNS FOR PCRE_PARTIAL Line 4536  RESTRICTED PATTERNS FOR PCRE_PARTIAL
4536    
4537         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
4538         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
4539         (-13).         (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to
4540           find out if a compiled pattern can be used for partial matching.
4541    
4542    
4543  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
4544    
4545         If the escape sequence \P is present  in  a  pcretest  data  line,  the         If  the  escape  sequence  \P  is  present in a pcretest data line, the
4546         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
4547         uses the date example quoted above:         uses the date example quoted above:
4548    
# Line 4474  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4559  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4559           data> j\P           data> j\P
4560           No match           No match
4561    
4562         The first data string is matched  completely,  so  pcretest  shows  the         The  first  data  string  is  matched completely, so pcretest shows the
4563         matched  substrings.  The  remaining four strings do not match the com-         matched substrings. The remaining four strings do not  match  the  com-
4564         plete pattern, but the first two are partial matches.  The  same  test,         plete  pattern,  but  the first two are partial matches. The same test,
4565         using  pcre_dfa_exec()  matching  (by means of the \D escape sequence),         using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),
4566         produces the following output:         produces the following output:
4567    
4568             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
# Line 4492  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4577  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4577           data> j\P\D           data> j\P\D
4578           No match           No match
4579    
4580         Notice that in this case the portion of the string that was matched  is         Notice  that in this case the portion of the string that was matched is
4581         made available.         made available.
4582    
4583    
4584  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
4585    
4586         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
4587         ble to continue the match by  providing  additional  subject  data  and         ble  to  continue  the  match  by providing additional subject data and
4588         calling  pcre_dfa_exec()  again  with the same compiled regular expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
4589         sion, this time setting the PCRE_DFA_RESTART option. You must also pass         sion, this time setting the PCRE_DFA_RESTART option. You must also pass
4590         the  same working space as before, because this is where details of the         the same working space as before, because this is where details of  the
4591         previous partial match are stored. Here is an example  using  pcretest,         previous  partial  match are stored. Here is an example using pcretest,
4592         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
4593         \D are as above):         \D are as above):
4594    
# Line 4513  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4598  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4598           data> n05\R\D           data> n05\R\D
4599            0: n05            0: n05
4600    
4601         The first call has "23ja" as the subject, and requests  partial  match-         The  first  call has "23ja" as the subject, and requests partial match-
4602         ing;  the  second  call  has  "n05"  as  the  subject for the continued         ing; the second call  has  "n05"  as  the  subject  for  the  continued
4603         (restarted) match.  Notice that when the match is  complete,  only  the         (restarted)  match.   Notice  that when the match is complete, only the
4604         last  part  is  shown;  PCRE  does not retain the previously partially-         last part is shown; PCRE does  not  retain  the  previously  partially-
4605         matched string. It is up to the calling program to do that if it  needs         matched  string. It is up to the calling program to do that if it needs
4606         to.         to.
4607    
4608         You  can  set  PCRE_PARTIAL  with  PCRE_DFA_RESTART to continue partial         You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial
4609         matching over multiple segments. This facility can be used to pass very         matching over multiple segments. This facility can be used to pass very
4610         long  subject  strings to pcre_dfa_exec(). However, some care is needed         long subject strings to pcre_dfa_exec(). However, some care  is  needed
4611         for certain types of pattern.         for certain types of pattern.
4612    
4613         1. If the pattern contains tests for the beginning or end  of  a  line,         1.  If  the  pattern contains tests for the beginning or end of a line,
4614         you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
4615         ate, when the subject string for any call does not contain  the  begin-         ate,  when  the subject string for any call does not contain the begin-
4616         ning or end of a line.         ning or end of a line.
4617    
4618         2.  If  the  pattern contains backward assertions (including \b or \B),         2. If the pattern contains backward assertions (including  \b  or  \B),
4619         you need to arrange for some overlap in the subject  strings  to  allow         you  need  to  arrange for some overlap in the subject strings to allow
4620         for  this.  For  example, you could pass the subject in chunks that are         for this. For example, you could pass the subject in  chunks  that  are
4621         500 bytes long, but in a buffer of 700 bytes, with the starting  offset         500  bytes long, but in a buffer of 700 bytes, with the starting offset
4622         set to 200 and the previous 200 bytes at the start of the buffer.         set to 200 and the previous 200 bytes at the start of the buffer.
4623    
4624         3.  Matching a subject string that is split into multiple segments does         3. Matching a subject string that is split into multiple segments  does
4625         not always produce exactly the same result as matching over one  single         not  always produce exactly the same result as matching over one single
4626         long  string.   The  difference arises when there are multiple matching         long string.  The difference arises when there  are  multiple  matching
4627         possibilities, because a partial match result is given only when  there         possibilities,  because a partial match result is given only when there
4628         are  no completed matches in a call to pcre_dfa_exec(). This means that         are no completed matches in a call to pcre_dfa_exec(). This means  that
4629         as soon as the shortest match has been found,  continuation  to  a  new         as  soon  as  the  shortest match has been found, continuation to a new
4630         subject segment is no longer possible.  Consider this pcretest example:         subject segment is no longer possible.  Consider this pcretest example:
4631    
4632             re> /dog(sbody)?/             re> /dog(sbody)?/
# Line 4553  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4638  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4638            0: dogsbody            0: dogsbody
4639            1: dog            1: dog
4640    
4641         The pattern matches the words "dog" or "dogsbody". When the subject  is         The  pattern matches the words "dog" or "dogsbody". When the subject is
4642         presented  in  several  parts  ("do" and "gsb" being the first two) the         presented in several parts ("do" and "gsb" being  the  first  two)  the
4643         match stops when "dog" has been found, and it is not possible  to  con-         match  stops  when "dog" has been found, and it is not possible to con-
4644         tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single         tinue. On the other hand,  if  "dogsbody"  is  presented  as  a  single
4645         string, both matches are found.         string, both matches are found.
4646    
4647         Because of this phenomenon, it does not usually make  sense  to  end  a         Because  of  this  phenomenon,  it does not usually make sense to end a
4648         pattern that is going to be matched in this way with a variable repeat.         pattern that is going to be matched in this way with a variable repeat.
4649    
4650         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
# Line 4568  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4653  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4653    
4654           1234|3789           1234|3789
4655    
4656         If the first part of the subject is "ABC123", a partial  match  of  the         If  the  first  part of the subject is "ABC123", a partial match of the
4657         first  alternative  is found at offset 3. There is no partial match for         first alternative is found at offset 3. There is no partial  match  for
4658         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
4659         point  in  the  subject  string. Attempting to continue with the string         point in the subject string. Attempting to  continue  with  the  string
4660         "789" does not yield a match because only those alternatives that match         "789" does not yield a match because only those alternatives that match
4661         at  one point in the subject are remembered. The problem arises because         at one point in the subject are remembered. The problem arises  because
4662         the start of the second alternative matches within the  first  alterna-         the  start  of the second alternative matches within the first alterna-
4663         tive. There is no problem with anchored patterns or patterns such as:         tive. There is no problem with anchored patterns or patterns such as:
4664    
4665           1234|ABCD           1234|ABCD
# Line 4591  AUTHOR Line 4676  AUTHOR
4676    
4677  REVISION  REVISION
4678    
4679         Last updated: 06 March 2007         Last updated: 04 June 2007
4680         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4681  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4682    

Legend:
Removed from v.155  
changed lines
  Added in v.172

  ViewVC Help
Powered by ViewVC 1.1.5