/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 148 by ph10, Mon Apr 16 13:25:10 2007 UTC revision 172 by ph10, Tue Jun 5 10:40:13 2007 UTC
# Line 72  USER DOCUMENTATION Line 72  USER DOCUMENTATION
72         of searching. The sections are as follows:         of searching. The sections are as follows:
73    
74           pcre              this document           pcre              this document
75             pcre-config       show PCRE installation configuration information
76           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
77           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
78           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
# Line 215  AUTHOR Line 216  AUTHOR
216         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
217    
218         Putting an actual email address here seems to have been a spam  magnet,         Putting an actual email address here seems to have been a spam  magnet,
219         so I've taken it away. If you want to email me, use my initial and sur-         so  I've  taken  it away. If you want to email me, use my two initials,
220         name, separated by a dot, at the domain ucs.cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
221    
222    
223  REVISION  REVISION
224    
225         Last updated: 06 March 2007         Last updated: 18 April 2007
226         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
227  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
228    
# Line 312  CODE VALUE OF NEWLINE Line 313  CODE VALUE OF NEWLINE
313    
314         to the configure command. There is a fourth option, specified by         to the configure command. There is a fourth option, specified by
315    
316             --enable-newline-is-anycrlf
317    
318           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
319           CRLF as indicating a line ending. Finally, a fifth option, specified by
320    
321           --enable-newline-is-any           --enable-newline-is-any
322    
323         which causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
324    
325         Whatever line ending convention is selected when PCRE is built  can  be         Whatever line ending convention is selected when PCRE is built  can  be
326         overridden  when  the library functions are called. At build time it is         overridden  when  the library functions are called. At build time it is
# Line 468  AUTHOR Line 474  AUTHOR
474    
475  REVISION  REVISION
476    
477         Last updated: 20 March 2007         Last updated: 16 April 2007
478         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
479  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
480    
# Line 604  THE ALTERNATIVE MATCHING ALGORITHM Line 610  THE ALTERNATIVE MATCHING ALGORITHM
610         ence  as  the  condition or test for a specific group recursion are not         ence  as  the  condition or test for a specific group recursion are not
611         supported.         supported.
612    
613         5. Callouts are supported, but the value of the  capture_top  field  is         5. Because many paths through the tree may be  active,  the  \K  escape
614           sequence, which resets the start of the match when encountered (but may
615           be on some paths and not on others), is not  supported.  It  causes  an
616           error if encountered.
617    
618           6.  Callouts  are  supported, but the value of the capture_top field is
619         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
620    
621         6.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
622         single byte, even in UTF-8 mode, is not supported because the  alterna-         single  byte, even in UTF-8 mode, is not supported because the alterna-
623         tive  algorithm  moves  through  the  subject string one character at a         tive algorithm moves through the subject  string  one  character  at  a
624         time, for all active paths through the tree.         time, for all active paths through the tree.
625    
626    
627  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
628    
629         Using the alternative matching algorithm provides the following  advan-         Using  the alternative matching algorithm provides the following advan-
630         tages:         tages:
631    
632         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
633         ically found, and in particular, the longest match is  found.  To  find         ically  found,  and  in particular, the longest match is found. To find
634         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
635         things with callouts.         things with callouts.
636    
637         2. There is much better support for partial matching. The  restrictions         2.  There is much better support for partial matching. The restrictions
638         on  the content of the pattern that apply when using the standard algo-         on the content of the pattern that apply when using the standard  algo-
639         rithm for partial matching do not apply to the  alternative  algorithm.         rithm  for  partial matching do not apply to the alternative algorithm.
640         For  non-anchored patterns, the starting position of a partial match is         For non-anchored patterns, the starting position of a partial match  is
641         available.         available.
642    
643         3. Because the alternative algorithm  scans  the  subject  string  just         3.  Because  the  alternative  algorithm  scans the subject string just
644         once,  and  never  needs to backtrack, it is possible to pass very long         once, and never needs to backtrack, it is possible to  pass  very  long
645         subject strings to the matching function in  several  pieces,  checking         subject  strings  to  the matching function in several pieces, checking
646         for partial matching each time.         for partial matching each time.
647    
648    
# Line 639  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 650  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
650    
651         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
652    
653         1.  It  is  substantially  slower  than the standard algorithm. This is         1. It is substantially slower than  the  standard  algorithm.  This  is
654         partly because it has to search for all possible matches, but  is  also         partly  because  it has to search for all possible matches, but is also
655         because it is less susceptible to optimization.         because it is less susceptible to optimization.
656    
657         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 658  AUTHOR Line 669  AUTHOR
669    
670  REVISION  REVISION
671    
672         Last updated: 06 March 2007         Last updated: 29 May 2007
673         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
674  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
675    
# Line 841  PCRE API OVERVIEW Line 852  PCRE API OVERVIEW
852    
853  NEWLINES  NEWLINES
854    
855         PCRE  supports four different conventions for indicating line breaks in         PCRE  supports five different conventions for indicating line breaks in
856         strings: a single CR (carriage return) character, a  single  LF  (line-         strings: a single CR (carriage return) character, a  single  LF  (line-
857         feed)  character,  the two-character sequence CRLF, or any Unicode new-         feed) character, the two-character sequence CRLF, any of the three pre-
858         line sequence.  The Unicode newline sequences are the three  just  men-         ceding, or any Unicode newline sequence. The Unicode newline  sequences
859         tioned, plus the single characters VT (vertical tab, U+000B), FF (form-         are  the  three just mentioned, plus the single characters VT (vertical
860         feed, U+000C), NEL (next line, U+0085), LS  (line  separator,  U+2028),         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
861         and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
862    
863         Each  of  the first three conventions is used by at least one operating         Each  of  the first three conventions is used by at least one operating
864         system as its standard newline sequence. When PCRE is built, a  default         system as its standard newline sequence. When PCRE is built, a  default
# Line 881  SAVING PRECOMPILED PATTERNS FOR LATER US Line 892  SAVING PRECOMPILED PATTERNS FOR LATER US
892         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
893         later time, possibly by a different program, and even on a  host  other         later time, possibly by a different program, and even on a  host  other
894         than  the  one  on  which  it  was  compiled.  Details are given in the         than  the  one  on  which  it  was  compiled.  Details are given in the
895         pcreprecompile documentation.         pcreprecompile documentation. However, compiling a  regular  expression
896           with  one version of PCRE for use with a different version is not guar-
897           anteed to work and may cause crashes.
898    
899    
900  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
# Line 912  CHECKING BUILD-TIME OPTIONS Line 925  CHECKING BUILD-TIME OPTIONS
925    
926         The output is an integer whose value specifies  the  default  character         The output is an integer whose value specifies  the  default  character
927         sequence  that is recognized as meaning "newline". The four values that         sequence  that is recognized as meaning "newline". The four values that
928         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY.         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
929         The default should normally be the standard sequence for your operating         and  -1  for  ANY. The default should normally be the standard sequence
930         system.         for your operating system.
931    
932           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
933    
# Line 1138  COMPILING A PATTERN Line 1151  COMPILING A PATTERN
1151           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1152           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1153           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1154             PCRE_NEWLINE_ANYCRLF
1155           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1156    
1157         These  options  override the default newline definition that was chosen         These  options  override the default newline definition that was chosen
1158         when PCRE was built. Setting the first or the second specifies  that  a         when PCRE was built. Setting the first or the second specifies  that  a
1159         newline  is  indicated  by a single character (CR or LF, respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1160         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1161         two-character  CRLF  sequence.  Setting PCRE_NEWLINE_ANY specifies that         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1162         any Unicode newline sequence should be recognized. The Unicode  newline         that any of the three preceding sequences should be recognized. Setting
1163         sequences  are  the three just mentioned, plus the single characters VT         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1164         (vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085),         recognized. The Unicode newline sequences are the three just mentioned,
1165         LS  (line separator, U+2028), and PS (paragraph separator, U+2029). The         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1166         last two are recognized only in UTF-8 mode.         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1167           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1168           UTF-8 mode.
1169    
1170         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
1171         treated  as  a  number, giving eight possibilities. Currently only five         treated as a number, giving eight possibilities. Currently only six are
1172         are used (default plus the four values above). This means that  if  you         used (default plus the five values above). This means that if  you  set
1173         set  more  than  one  newline option, the combination may or may not be         more  than one newline option, the combination may or may not be sensi-
1174         sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is  equiva-         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1175         lent  to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1176         and cause an error.         cause an error.
1177    
1178         The only time that a line break is specially recognized when  compiling         The only time that a line break is specially recognized when  compiling
1179         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
# Line 1460  INFORMATION ABOUT A PATTERN Line 1476  INFORMATION ABOUT A PATTERN
1476         returned. The fourth argument should point to an unsigned char *  vari-         returned. The fourth argument should point to an unsigned char *  vari-
1477         able.         able.
1478    
1479             PCRE_INFO_JCHANGED
1480    
1481           Return  1  if the (?J) option setting is used in the pattern, otherwise
1482           0. The fourth argument should point to an int variable. The (?J) inter-
1483           nal option setting changes the local PCRE_DUPNAMES value.
1484    
1485           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1486    
1487         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1514  INFORMATION ABOUT A PATTERN Line 1536  INFORMATION ABOUT A PATTERN
1536         name-to-number map, remember that the length of the entries  is  likely         name-to-number map, remember that the length of the entries  is  likely
1537         to be different for each compiled pattern.         to be different for each compiled pattern.
1538    
1539             PCRE_INFO_OKPARTIAL
1540    
1541           Return  1 if the pattern can be used for partial matching, otherwise 0.
1542           The fourth argument should point to an int  variable.  The  pcrepartial
1543           documentation  lists  the restrictions that apply to patterns when par-
1544           tial matching is used.
1545    
1546           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1547    
1548         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1549         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1550         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1551         by any top-level option settings within the pattern itself.         by any top-level option settings within the pattern itself.
1552    
1553         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1554         alternatives begin with one of the following:         alternatives begin with one of the following:
1555    
1556           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1535  INFORMATION ABOUT A PATTERN Line 1564  INFORMATION ABOUT A PATTERN
1564    
1565           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1566    
1567         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1568         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1569         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1570         size_t variable.         size_t variable.
# Line 1543  INFORMATION ABOUT A PATTERN Line 1572  INFORMATION ABOUT A PATTERN
1572           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1573    
1574         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1575         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1576         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1577         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1578         variable.         variable.
1579    
1580    
# Line 1553  OBSOLETE INFO FUNCTION Line 1582  OBSOLETE INFO FUNCTION
1582    
1583         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1584    
1585         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1586         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1587         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1588         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1589         lowing negative numbers:         lowing negative numbers:
1590    
1591           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1592           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1593    
1594         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1595         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1596         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1597    
1598         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1599         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1600         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1601    
1602    
# Line 1575  REFERENCE COUNTS Line 1604  REFERENCE COUNTS
1604    
1605         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1606    
1607         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1608         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1609         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1610         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1611         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1612    
1613         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1614         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1615         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1616         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1617         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1618         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1619    
1620         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1621         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1622         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1623    
1624    
# Line 1599  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1628  MATCHING A PATTERN: THE TRADITIONAL FUNC
1628              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1629              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1630    
1631         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1632         compiled pattern, which is passed in the code argument. If the  pattern         compiled  pattern, which is passed in the code argument. If the pattern
1633         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1634         argument. This function is the main matching facility of  the  library,         argument.  This  function is the main matching facility of the library,
1635         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1636         an alternative matching function, which is described below in the  sec-         an  alternative matching function, which is described below in the sec-
1637         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1638    
1639         In  most applications, the pattern will have been compiled (and option-         In most applications, the pattern will have been compiled (and  option-
1640         ally studied) in the same process that calls pcre_exec().  However,  it         ally  studied)  in the same process that calls pcre_exec(). However, it
1641         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1642         later in different processes, possibly even on different hosts.  For  a         later  in  different processes, possibly even on different hosts. For a
1643         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1644    
1645         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1629  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1658  MATCHING A PATTERN: THE TRADITIONAL FUNC
1658    
1659     Extra data for pcre_exec()     Extra data for pcre_exec()
1660    
1661         If  the  extra argument is not NULL, it must point to a pcre_extra data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1662         block. The pcre_study() function returns such a block (when it  doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1663         return  NULL), but you can also create one for yourself, and pass addi-         return NULL), but you can also create one for yourself, and pass  addi-
1664         tional information in it. The pcre_extra block contains  the  following         tional  information  in it. The pcre_extra block contains the following
1665         fields (not necessarily in this order):         fields (not necessarily in this order):
1666    
1667           unsigned long int flags;           unsigned long int flags;
# Line 1642  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1671  MATCHING A PATTERN: THE TRADITIONAL FUNC
1671           void *callout_data;           void *callout_data;
1672           const unsigned char *tables;           const unsigned char *tables;
1673    
1674         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1675         are set. The flag bits are:         are set. The flag bits are:
1676    
1677           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1651  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1680  MATCHING A PATTERN: THE TRADITIONAL FUNC
1680           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1681           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1682    
1683         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1684         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1685         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1686         add  to  the  block by setting the other fields and their corresponding         add to the block by setting the other fields  and  their  corresponding
1687         flag bits.         flag bits.
1688    
1689         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1690         a  vast amount of resources when running patterns that are not going to         a vast amount of resources when running patterns that are not going  to
1691         match, but which have a very large number  of  possibilities  in  their         match,  but  which  have  a very large number of possibilities in their
1692         search  trees.  The  classic  example  is  the  use of nested unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1693         repeats.         repeats.
1694    
1695         Internally, PCRE uses a function called match() which it calls  repeat-         Internally,  PCRE uses a function called match() which it calls repeat-
1696         edly  (sometimes  recursively). The limit set by match_limit is imposed         edly (sometimes recursively). The limit set by match_limit  is  imposed
1697         on the number of times this function is called during  a  match,  which         on  the  number  of times this function is called during a match, which
1698         has  the  effect  of  limiting the amount of backtracking that can take         has the effect of limiting the amount of  backtracking  that  can  take
1699         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1700         for each position in the subject string.         for each position in the subject string.
1701    
1702         The  default  value  for  the  limit can be set when PCRE is built; the         The default value for the limit can be set  when  PCRE  is  built;  the
1703         default default is 10 million, which handles all but the  most  extreme         default  default  is 10 million, which handles all but the most extreme
1704         cases.  You  can  override  the  default by suppling pcre_exec() with a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1705         pcre_extra    block    in    which    match_limit    is    set,     and         pcre_extra     block    in    which    match_limit    is    set,    and
1706         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1707         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1708    
1709         The match_limit_recursion field is similar to match_limit, but  instead         The  match_limit_recursion field is similar to match_limit, but instead
1710         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1711         the depth of recursion. The recursion depth is a  smaller  number  than         the  depth  of  recursion. The recursion depth is a smaller number than
1712         the  total number of calls, because not all calls to match() are recur-         the total number of calls, because not all calls to match() are  recur-
1713         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1714    
1715         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1716         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1717         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1718    
1719         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1720         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1721         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1722         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1723         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1724         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1725    
1726         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1727         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1728    
1729         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1730         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1731         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1732         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1733         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1734         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1735         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1736         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1737         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1738         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1739    
1740     Option bits for pcre_exec()     Option bits for pcre_exec()
1741    
1742         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1743         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1744         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1745         PCRE_PARTIAL.         PCRE_PARTIAL.
1746    
1747           PCRE_ANCHORED           PCRE_ANCHORED
1748    
1749         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1750         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1751         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1752         unachored at matching time.         unachored at matching time.
1753    
1754           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1755           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1756           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1757             PCRE_NEWLINE_ANYCRLF
1758           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1759    
1760         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
1761         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
1762         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
1763         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1764         ters. It may also alter the way the match position is advanced after  a         ters.  It may also alter the way the match position is advanced after a
1765         match  failure  for  an  unanchored  pattern. When PCRE_NEWLINE_CRLF or         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1766         PCRE_NEWLINE_ANY is set, and a match attempt  fails  when  the  current         PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt
1767         position  is  at a CRLF sequence, the match position is advanced by two         fails when the current position is at a CRLF sequence, the match  posi-
1768         characters instead of one, in other words, to after the CRLF.         tion  is  advanced by two characters instead of one, in other words, to
1769           after the CRLF.
1770    
1771           PCRE_NOTBOL           PCRE_NOTBOL
1772    
# Line 2375  AUTHOR Line 2406  AUTHOR
2406    
2407  REVISION  REVISION
2408    
2409         Last updated: 06 March 2007         Last updated: 04 June 2007
2410         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2411  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2412    
# Line 2403  PCRE CALLOUTS Line 2434  PCRE CALLOUTS
2434         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
2435         points:         points:
2436    
2437           (?C1)eabc(?C2)def           (?C1)abc(?C2)def
2438    
2439         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2440         called, PCRE automatically  inserts  callouts,  all  with  number  255,         called, PCRE automatically  inserts  callouts,  all  with  number  255,
# Line 2478  THE CALLOUT INTERFACE Line 2509  THE CALLOUT INTERFACE
2509         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2510         were passed to pcre_exec().         were passed to pcre_exec().
2511    
2512         The start_match field contains the offset within the subject  at  which         The start_match field normally contains the offset within  the  subject
2513         the  current match attempt started. If the pattern is not anchored, the         at  which  the  current  match  attempt started. However, if the escape
2514         callout function may be called several times from the same point in the         sequence \K has been encountered, this value is changed to reflect  the
2515         pattern for different starting points in the subject.         modified  starting  point.  If the pattern is not anchored, the callout
2516           function may be called several times from the same point in the pattern
2517           for different starting points in the subject.
2518    
2519         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
2520         the current match pointer.         the current match pointer.
# Line 2544  AUTHOR Line 2577  AUTHOR
2577    
2578  REVISION  REVISION
2579    
2580         Last updated: 06 March 2007         Last updated: 29 May 2007
2581         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2582  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2583    
# Line 2705  PCRE REGULAR EXPRESSION DETAILS Line 2738  PCRE REGULAR EXPRESSION DETAILS
2738         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2739         From  release  6.0,   PCRE   offers   a   second   matching   function,         From  release  6.0,   PCRE   offers   a   second   matching   function,
2740         pcre_dfa_exec(),  which matches using a different algorithm that is not         pcre_dfa_exec(),  which matches using a different algorithm that is not
2741         Perl-compatible. The advantages and disadvantages  of  the  alternative         Perl-compatible. Some of the features discussed below are not available
2742         function, and how it differs from the normal function, are discussed in         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2743         the pcrematching page.         alternative function, and how it differs from the normal function,  are
2744           discussed in the pcrematching page.
2745    
2746    
2747  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
2748    
2749         A regular expression is a pattern that is  matched  against  a  subject         A  regular  expression  is  a pattern that is matched against a subject
2750         string  from  left  to right. Most characters stand for themselves in a         string from left to right. Most characters stand for  themselves  in  a
2751         pattern, and match the corresponding characters in the  subject.  As  a         pattern,  and  match  the corresponding characters in the subject. As a
2752         trivial example, the pattern         trivial example, the pattern
2753    
2754           The quick brown fox           The quick brown fox
2755    
2756         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2757         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2758         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2759         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2760         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2761         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2762         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2763         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2764         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2765    
2766         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2767         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2768         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2769         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2770    
2771         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2772         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2773         that are recognized within square brackets.  Outside  square  brackets,         that  are  recognized  within square brackets. Outside square brackets,
2774         the metacharacters are as follows:         the metacharacters are as follows:
2775    
2776           \      general escape character with several uses           \      general escape character with several uses
# Line 2755  CHARACTERS AND METACHARACTERS Line 2789  CHARACTERS AND METACHARACTERS
2789                  also "possessive quantifier"                  also "possessive quantifier"
2790           {      start min/max quantifier           {      start min/max quantifier
2791    
2792         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2793         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2794    
2795           \      general escape character           \      general escape character
# Line 2765  CHARACTERS AND METACHARACTERS Line 2799  CHARACTERS AND METACHARACTERS
2799                    syntax)                    syntax)
2800           ]      terminates the character class           ]      terminates the character class
2801    
2802         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2803    
2804    
2805  BACKSLASH  BACKSLASH
2806    
2807         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2808         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2809         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2810         applies both inside and outside character classes.         applies both inside and outside character classes.
2811    
2812         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2813         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2814         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2815         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2816         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2817         slash, you write \\.         slash, you write \\.
2818    
2819         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2820         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2821         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
2822         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
2823         part of the pattern.         part of the pattern.
2824    
2825         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2826         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2827         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2828         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
2829         tion. Note the following examples:         tion. Note the following examples:
2830    
2831           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2801  BACKSLASH Line 2835  BACKSLASH
2835           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2836           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2837    
2838         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2839         classes.         classes.
2840    
2841     Non-printing characters     Non-printing characters
2842    
2843         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2844         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
2845         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
2846         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
2847         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
2848         sequences than the binary character it represents:         sequences than the binary character it represents:
2849    
2850           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 2824  BACKSLASH Line 2858  BACKSLASH
2858           \xhh      character with hex code hh           \xhh      character with hex code hh
2859           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
2860    
2861         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
2862         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
2863         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
2864         becomes hex 7B.         becomes hex 7B.
2865    
2866         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
2867         in  upper  or  lower case). Any number of hexadecimal digits may appear         in upper or lower case). Any number of hexadecimal  digits  may  appear
2868         between \x{ and }, but the value of the character  code  must  be  less         between  \x{  and  },  but the value of the character code must be less
2869         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2870         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than         the  maximum  hexadecimal  value is 7FFFFFFF). If characters other than
2871         hexadecimal  digits  appear between \x{ and }, or if there is no termi-         hexadecimal digits appear between \x{ and }, or if there is  no  termi-
2872         nating }, this form of escape is not recognized.  Instead, the  initial         nating  }, this form of escape is not recognized.  Instead, the initial
2873         \x will be interpreted as a basic hexadecimal escape, with no following         \x will be interpreted as a basic hexadecimal escape, with no following
2874         digits, giving a character whose value is zero.         digits, giving a character whose value is zero.
2875    
2876         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2877         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x. There is no difference in the way  they  are  han-
2878         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
2879    
2880         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
2881         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
2882         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2883         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
2884         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
2885    
2886         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2887         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2888         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
2889         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2890         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
2891         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
2892         of parenthesized subpatterns.         of parenthesized subpatterns.
2893    
2894         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
2895         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
2896         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
2897         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves.  In
2898         non-UTF-8 mode, the value of a character specified  in  octal  must  be         non-UTF-8  mode,  the  value  of a character specified in octal must be
2899         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
2900         example:         example:
2901    
2902           \040   is another way of writing a space           \040   is another way of writing a space
# Line 2880  BACKSLASH Line 2914  BACKSLASH
2914           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2915                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2916    
2917         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
2918         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2919    
2920         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
2921         inside and outside character classes. In addition, inside  a  character         inside  and  outside character classes. In addition, inside a character
2922         class,  the  sequence \b is interpreted as the backspace character (hex         class, the sequence \b is interpreted as the backspace  character  (hex
2923         08), and the sequences \R and \X are interpreted as the characters  "R"         08),  and the sequences \R and \X are interpreted as the characters "R"
2924         and  "X", respectively. Outside a character class, these sequences have         and "X", respectively. Outside a character class, these sequences  have
2925         different meanings (see below).         different meanings (see below).
2926    
2927     Absolute and relative back references     Absolute and relative back references
2928    
2929         The sequence \g followed by a positive or negative  number,  optionally         The  sequence  \g followed by a positive or negative number, optionally
2930         enclosed  in  braces,  is  an absolute or relative back reference. Back         enclosed in braces, is an absolute or relative back reference. A  named
2931         references are discussed later, following the discussion  of  parenthe-         back  reference can be coded as \g{name}. Back references are discussed
2932         sized subpatterns.         later, following the discussion of parenthesized subpatterns.
2933    
2934     Generic character types     Generic character types
2935    
# Line 2910  BACKSLASH Line 2944  BACKSLASH
2944           \W     any "non-word" character           \W     any "non-word" character
2945    
2946         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2947         into  two disjoint sets. Any given character matches one, and only one,         into two disjoint sets. Any given character matches one, and only  one,
2948         of each pair.         of each pair.
2949    
2950         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2951         acter  classes.  They each match one character of the appropriate type.         acter classes. They each match one character of the  appropriate  type.
2952         If the current matching point is at the end of the subject string,  all         If  the current matching point is at the end of the subject string, all
2953         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2954    
2955         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
2956         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
2957         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If         characters are HT (9), LF (10), FF (12), CR (13), and space  (32).  (If
2958         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
2959         ter. In PCRE, it never does.)         ter. In PCRE, it never does.)
2960    
2961         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
2962         is a letter or digit. The definition of  letters  and  digits  is  con-         is  a  letter  or  digit.  The definition of letters and digits is con-
2963         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
2964         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
2965         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
2966         systems, or "french" in Windows, some character codes greater than  128         systems,  or "french" in Windows, some character codes greater than 128
2967         are used for accented letters, and these are matched by \w.         are used for accented letters, and these are matched by \w.
2968    
2969         In  UTF-8 mode, characters with values greater than 128 never match \d,         In UTF-8 mode, characters with values greater than 128 never match  \d,
2970         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2971         code  character  property support is available. The use of locales with         code character property support is available. The use of  locales  with
2972         Unicode is discouraged.         Unicode is discouraged.
2973    
2974     Newline sequences     Newline sequences
2975    
2976         Outside a character class, the escape sequence \R matches  any  Unicode         Outside  a  character class, the escape sequence \R matches any Unicode
2977         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is
2978         equivalent to the following:         equivalent to the following:
2979    
2980           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
2981    
2982         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
2983         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
2984         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
2985         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
2986         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
2987         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
2988    
2989         In  UTF-8  mode, two additional characters whose codepoints are greater         In UTF-8 mode, two additional characters whose codepoints  are  greater
2990         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
2991         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
2992         these characters to be recognized.         these characters to be recognized.
2993    
2994         Inside a character class, \R matches the letter "R".         Inside a character class, \R matches the letter "R".
# Line 2962  BACKSLASH Line 2996  BACKSLASH
2996     Unicode character properties     Unicode character properties
2997    
2998         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
2999         tional  escape  sequences  to  match character properties are available         tional escape sequences to match  character  properties  are  available
3000         when UTF-8 mode is selected. They are:         when UTF-8 mode is selected. They are:
3001    
3002           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3003           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3004           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3005    
3006         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
3007         script names, the general category properties, and "Any", which matches         script names, the general category properties, and "Any", which matches
3008         any character (including newline). Other properties such as "InMusical-         any character (including newline). Other properties such as "InMusical-
3009         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does         Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
3010         not match any characters, so always causes a match failure.         not match any characters, so always causes a match failure.
3011    
3012         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3013         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3014         For example:         For example:
3015    
3016           \p{Greek}           \p{Greek}
3017           \P{Han}           \P{Han}
3018    
3019         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3020         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3021    
3022         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
3023         Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,         Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
3024         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3025         Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-         Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3026         gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,         gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
3027         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
3028         Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,         Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
3029         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
3030         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3031    
3032         Each  character has exactly one general category property, specified by         Each character has exactly one general category property, specified  by
3033         a two-letter abbreviation. For compatibility with Perl, negation can be         a two-letter abbreviation. For compatibility with Perl, negation can be
3034         specified  by  including a circumflex between the opening brace and the         specified by including a circumflex between the opening brace  and  the
3035         property name. For example, \p{^Lu} is the same as \P{Lu}.         property name. For example, \p{^Lu} is the same as \P{Lu}.
3036    
3037         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3038         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3039         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3040         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3041    
3042           \p{L}           \p{L}
# Line 3054  BACKSLASH Line 3088  BACKSLASH
3088           Zp    Paragraph separator           Zp    Paragraph separator
3089           Zs    Space separator           Zs    Space separator
3090    
3091         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3092         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3093         classified as a modifier or "other".         classified as a modifier or "other".
3094    
3095         The  long  synonyms  for  these  properties that Perl supports (such as         The long synonyms for these properties  that  Perl  supports  (such  as
3096         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3097         any of these properties with "Is".         any of these properties with "Is".
3098    
3099         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3100         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3101         in the Unicode table.         in the Unicode table.
3102    
3103         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3104         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3105    
3106         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3107         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3108    
3109           (?>\PM\pM*)           (?>\PM\pM*)
3110    
3111         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3112         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3113         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3114         property are typically accents that affect the preceding character.         property are typically accents that affect the preceding character.
3115    
3116         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3117         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3118         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3119         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
3120    
3121       Resetting the match start
3122    
3123           The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3124           ously  matched  characters  not  to  be  included  in the final matched
3125           sequence. For example, the pattern:
3126    
3127             foo\Kbar
3128    
3129           matches "foobar", but reports that it has matched "bar".  This  feature
3130           is  similar  to  a lookbehind assertion (described below).  However, in
3131           this case, the part of the subject before the real match does not  have
3132           to  be of fixed length, as lookbehind assertions do. The use of \K does
3133           not interfere with the setting of captured  substrings.   For  example,
3134           when the pattern
3135    
3136             (foo)\Kbar
3137    
3138           matches "foobar", the first substring is still set to "foo".
3139    
3140     Simple assertions     Simple assertions
3141    
3142         The  final use of backslash is for certain simple assertions. An asser-         The  final use of backslash is for certain simple assertions. An asser-
# Line 3845  BACK REFERENCES Line 3898  BACK REFERENCES
3898         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3899         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3900    
3901         Back references to named subpatterns use the Perl  syntax  \k<name>  or         There are several different ways of writing back  references  to  named
3902         \k'name'  or  the  Python  syntax (?P=name). We could rewrite the above         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
3903         example in either of the following ways:         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
3904           unified back reference syntax, in which \g can be used for both numeric
3905           and named references, is also supported. We  could  rewrite  the  above
3906           example in any of the following ways:
3907    
3908           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
3909             (?'p1'(?i)rah)\s+\k{p1}
3910           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
3911             (?<p1>(?i)rah)\s+\g{p1}
3912    
3913         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
3914         before or after the reference.         before or after the reference.
3915    
3916         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
3917         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
3918         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
3919    
3920           (a|(bc))\2           (a|(bc))\2
3921    
3922         always  fails if it starts to match "a" rather than "bc". Because there         always fails if it starts to match "a" rather than "bc". Because  there
3923         may be many capturing parentheses in a pattern,  all  digits  following         may  be  many  capturing parentheses in a pattern, all digits following
3924         the  backslash  are taken as part of a potential back reference number.         the backslash are taken as part of a potential back  reference  number.
3925         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
3926         used  to  terminate  the back reference. If the PCRE_EXTENDED option is         used to terminate the back reference. If the  PCRE_EXTENDED  option  is
3927         set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
3928         ments" below) can be used.         ments" below) can be used.
3929    
3930         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
3931         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
3932         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
3933         patterns. For example, the pattern         patterns. For example, the pattern
3934    
3935           (a|b\1)+           (a|b\1)+
3936    
3937         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
3938         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
3939         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
3940         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
3941         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
3942         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
3943    
3944    
3945  ASSERTIONS  ASSERTIONS
3946    
3947         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
3948         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
3949         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
3950         described above.         described above.
3951    
3952         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
3953         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
3954         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
3955         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
3956         matching position to be changed.         matching position to be changed.
3957    
3958         Assertion subpatterns are not capturing subpatterns,  and  may  not  be         Assertion  subpatterns  are  not  capturing subpatterns, and may not be
3959         repeated,  because  it  makes no sense to assert the same thing several         repeated, because it makes no sense to assert the  same  thing  several
3960         times. If any kind of assertion contains capturing  subpatterns  within         times.  If  any kind of assertion contains capturing subpatterns within
3961         it,  these are counted for the purposes of numbering the capturing sub-         it, these are counted for the purposes of numbering the capturing  sub-
3962         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
3963         out  only  for  positive assertions, because it does not make sense for         out only for positive assertions, because it does not  make  sense  for
3964         negative assertions.         negative assertions.
3965    
3966     Lookahead assertions     Lookahead assertions
# Line 3912  ASSERTIONS Line 3970  ASSERTIONS
3970    
3971           \w+(?=;)           \w+(?=;)
3972    
3973         matches  a word followed by a semicolon, but does not include the semi-         matches a word followed by a semicolon, but does not include the  semi-
3974         colon in the match, and         colon in the match, and
3975    
3976           foo(?!bar)           foo(?!bar)
3977    
3978         matches any occurrence of "foo" that is not  followed  by  "bar".  Note         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
3979         that the apparently similar pattern         that the apparently similar pattern
3980    
3981           (?!foo)bar           (?!foo)bar
3982    
3983         does  not  find  an  occurrence  of "bar" that is preceded by something         does not find an occurrence of "bar"  that  is  preceded  by  something
3984         other than "foo"; it finds any occurrence of "bar" whatsoever,  because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
3985         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
3986         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
3987    
3988         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
3989         most  convenient  way  to  do  it  is with (?!) because an empty string         most convenient way to do it is  with  (?!)  because  an  empty  string
3990         always matches, so an assertion that requires there not to be an  empty         always  matches, so an assertion that requires there not to be an empty
3991         string must always fail.         string must always fail.
3992    
3993     Lookbehind assertions     Lookbehind assertions
3994    
3995         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
3996         for negative assertions. For example,         for negative assertions. For example,
3997    
3998           (?<!foo)bar           (?<!foo)bar
3999    
4000         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4001         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4002         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4003         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4004         fixed length. Thus         fixed length. Thus
4005    
4006           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 3951  ASSERTIONS Line 4009  ASSERTIONS
4009    
4010           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4011    
4012         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4013         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4014         This is an extension compared with  Perl  (at  least  for  5.8),  which         This  is  an  extension  compared  with  Perl (at least for 5.8), which
4015         requires  all branches to match the same length of string. An assertion         requires all branches to match the same length of string. An  assertion
4016         such as         such as
4017    
4018           (?<=ab(c|de))           (?<=ab(c|de))
4019    
4020         is not permitted, because its single top-level  branch  can  match  two         is  not  permitted,  because  its single top-level branch can match two
4021         different  lengths,  but  it is acceptable if rewritten to use two top-         different lengths, but it is acceptable if rewritten to  use  two  top-
4022         level branches:         level branches:
4023    
4024           (?<=abc|abde)           (?<=abc|abde)
4025    
4026         The implementation of lookbehind assertions is, for  each  alternative,         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4027         to  temporarily  move the current position back by the fixed length and         instead of a lookbehind assertion; this is not restricted to  a  fixed-
4028           length.
4029    
4030           The  implementation  of lookbehind assertions is, for each alternative,
4031           to temporarily move the current position back by the fixed  length  and
4032         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4033         rent position, the assertion fails.         rent position, the assertion fails.
4034    
4035         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4036         mode) to appear in lookbehind assertions, because it makes it  impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
4037         ble  to  calculate the length of the lookbehind. The \X and \R escapes,         ble to calculate the length of the lookbehind. The \X and  \R  escapes,
4038         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
4039    
4040         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
4041         assertions  to  specify  efficient  matching  at the end of the subject         assertions to specify efficient matching at  the  end  of  the  subject
4042         string. Consider a simple pattern such as         string. Consider a simple pattern such as
4043    
4044           abcd$           abcd$
4045    
4046         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
4047         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4048         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
4049         pattern is specified as         pattern is specified as
4050    
4051           ^.*abcd$           ^.*abcd$
4052    
4053         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
4054         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4055         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
4056         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
4057         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4058    
4059           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4060    
4061         there  can  be  no backtracking for the .*+ item; it can match only the         there can be no backtracking for the .*+ item; it can  match  only  the
4062         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
4063         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
4064         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
4065         processing time.         processing time.
4066    
4067     Using multiple assertions     Using multiple assertions
# Line 4008  ASSERTIONS Line 4070  ASSERTIONS
4070    
4071           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4072    
4073         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
4074         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
4075         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
4076         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
4077         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4078         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
4079         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
4080         foo". A pattern to do that is         foo". A pattern to do that is
4081    
4082           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4083    
4084         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
4085         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4086         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4087    
# Line 4027  ASSERTIONS Line 4089  ASSERTIONS
4089    
4090           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4091    
4092         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
4093         is not preceded by "foo", while         is not preceded by "foo", while
4094    
4095           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4096    
4097         is  another pattern that matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
4098         three characters that are not "999".         three characters that are not "999".
4099    
4100    
4101  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4102    
4103         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
4104         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
4105         on the result of an assertion, or whether a previous capturing  subpat-         on  the result of an assertion, or whether a previous capturing subpat-
4106         tern  matched  or not. The two possible forms of conditional subpattern         tern matched or not. The two possible forms of  conditional  subpattern
4107         are         are
4108    
4109           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4110           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4111    
4112         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
4113         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4114         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4115    
4116         There are four kinds of condition: references  to  subpatterns,  refer-         There  are  four  kinds of condition: references to subpatterns, refer-
4117         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4118    
4119     Checking for a used subpattern by number     Checking for a used subpattern by number
4120    
4121         If  the  text between the parentheses consists of a sequence of digits,         If the text between the parentheses consists of a sequence  of  digits,
4122         the condition is true if the capturing subpattern of  that  number  has         the  condition  is  true if the capturing subpattern of that number has
4123         previously matched.         previously matched. An alternative notation is to  precede  the  digits
4124           with a plus or minus sign. In this case, the subpattern number is rela-
4125           tive rather than absolute.  The most recently opened parentheses can be
4126           referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
4127           looping constructs it can also make sense to refer to subsequent groups
4128           with constructs such as (?(+2).
4129    
4130         Consider  the  following  pattern, which contains non-significant white         Consider  the  following  pattern, which contains non-significant white
4131         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
# Line 4077  CONDITIONAL SUBPATTERNS Line 4144  CONDITIONAL SUBPATTERNS
4144         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
4145         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4146    
4147           If you were embedding this pattern in a larger one,  you  could  use  a
4148           relative reference:
4149    
4150             ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4151    
4152           This  makes  the  fragment independent of the parentheses in the larger
4153           pattern.
4154    
4155     Checking for a used subpattern by name     Checking for a used subpattern by name
4156    
4157         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
# Line 4218  RECURSIVE PATTERNS Line 4293  RECURSIVE PATTERNS
4293           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4294    
4295         We  have  put the pattern into parentheses, and caused the recursion to         We  have  put the pattern into parentheses, and caused the recursion to
4296         refer to them instead of the whole pattern. In a larger pattern,  keep-         refer to them instead of the whole pattern.
4297         ing  track  of parenthesis numbers can be tricky. It may be more conve-  
4298         nient to use named parentheses instead. The Perl  syntax  for  this  is         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
4299         (?&name);  PCRE's  earlier syntax (?P>name) is also supported. We could         tricky.  This is made easier by the use of relative references. (A Perl
4300         rewrite the above example as follows:         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
4301           (?-2) to refer to the second most recently opened parentheses preceding
4302           the recursion. In other  words,  a  negative  number  counts  capturing
4303           parentheses leftwards from the point at which it is encountered.
4304    
4305           It  is  also  possible  to refer to subsequently opened parentheses, by
4306           writing references such as (?+2). However, these  cannot  be  recursive
4307           because  the  reference  is  not inside the parentheses that are refer-
4308           enced. They are always "subroutine" calls, as  described  in  the  next
4309           section.
4310    
4311           An  alternative  approach is to use named parentheses instead. The Perl
4312           syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
4313           supported. We could rewrite the above example as follows:
4314    
4315           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4316    
4317         If there is more than one subpattern with the same name,  the  earliest         If  there  is more than one subpattern with the same name, the earliest
4318         one  is used. This particular example pattern contains nested unlimited         one is used.
4319         repeats, and so the use of atomic grouping for matching strings of non-  
4320         parentheses  is  important when applying the pattern to strings that do         This particular example pattern that we have been looking  at  contains
4321         not match. For example, when this pattern is applied to         nested  unlimited repeats, and so the use of atomic grouping for match-
4322           ing strings of non-parentheses is important when applying  the  pattern
4323           to strings that do not match. For example, when this pattern is applied
4324           to
4325    
4326           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4327    
# Line 4280  SUBPATTERNS AS SUBROUTINES Line 4371  SUBPATTERNS AS SUBROUTINES
4371         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4372         by  name)  is used outside the parentheses to which it refers, it oper-         by  name)  is used outside the parentheses to which it refers, it oper-
4373         ates like a subroutine in a programming language. The "called"  subpat-         ates like a subroutine in a programming language. The "called"  subpat-
4374         tern  may  be defined before or after the reference. An earlier example         tern may be defined before or after the reference. A numbered reference
4375         pointed out that the pattern         can be absolute or relative, as in these examples:
4376    
4377             (...(absolute)...)...(?2)...
4378             (...(relative)...)...(?-1)...
4379             (...(?+1)...(relative)...
4380    
4381           An earlier example pointed out that the pattern
4382    
4383           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4384    
# Line 4303  SUBPATTERNS AS SUBROUTINES Line 4400  SUBPATTERNS AS SUBROUTINES
4400         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4401         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4402    
4403           (abc)(?i:(?1))           (abc)(?i:(?-1))
4404    
4405         It matches "abcabc". It does not match "abcABC" because the  change  of         It matches "abcabc". It does not match "abcABC" because the  change  of
4406         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
# Line 4358  AUTHOR Line 4455  AUTHOR
4455    
4456  REVISION  REVISION
4457    
4458         Last updated: 06 March 2007         Last updated: 29 May 2007
4459         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4460  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4461    
# Line 4439  RESTRICTED PATTERNS FOR PCRE_PARTIAL Line 4536  RESTRICTED PATTERNS FOR PCRE_PARTIAL
4536    
4537         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
4538         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
4539         (-13).         (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to
4540           find out if a compiled pattern can be used for partial matching.
4541    
4542    
4543  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
4544    
4545         If the escape sequence \P is present  in  a  pcretest  data  line,  the         If  the  escape  sequence  \P  is  present in a pcretest data line, the
4546         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
4547         uses the date example quoted above:         uses the date example quoted above:
4548    
# Line 4461  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4559  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4559           data> j\P           data> j\P
4560           No match           No match
4561    
4562         The first data string is matched  completely,  so  pcretest  shows  the         The  first  data  string  is  matched completely, so pcretest shows the
4563         matched  substrings.  The  remaining four strings do not match the com-         matched substrings. The remaining four strings do not  match  the  com-
4564         plete pattern, but the first two are partial matches.  The  same  test,         plete  pattern,  but  the first two are partial matches. The same test,
4565         using  pcre_dfa_exec()  matching  (by means of the \D escape sequence),         using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),
4566         produces the following output:         produces the following output:
4567    
4568             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4569           data> 25jun04\P\D           data> 25jun04\P\D
4570            0: 25jun04            0: 25jun04
4571           data> 23dec3\P\D           data> 23dec3\P\D
# Line 4479  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4577  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4577           data> j\P\D           data> j\P\D
4578           No match           No match
4579    
4580         Notice that in this case the portion of the string that was matched  is         Notice  that in this case the portion of the string that was matched is
4581         made available.         made available.
4582    
4583    
4584  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
4585    
4586         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
4587         ble to continue the match by  providing  additional  subject  data  and         ble  to  continue  the  match  by providing additional subject data and
4588         calling  pcre_dfa_exec()  again  with the same compiled regular expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
4589         sion, this time setting the PCRE_DFA_RESTART option. You must also pass         sion, this time setting the PCRE_DFA_RESTART option. You must also pass
4590         the  same working space as before, because this is where details of the         the same working space as before, because this is where details of  the
4591         previous partial match are stored. Here is an example  using  pcretest,         previous  partial  match are stored. Here is an example using pcretest,
4592         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
4593         \D are as above):         \D are as above):
4594    
4595             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4596           data> 23ja\P\D           data> 23ja\P\D
4597           Partial match: 23ja           Partial match: 23ja
4598           data> n05\R\D           data> n05\R\D
4599            0: n05            0: n05
4600    
4601         The first call has "23ja" as the subject, and requests  partial  match-         The  first  call has "23ja" as the subject, and requests partial match-
4602         ing;  the  second  call  has  "n05"  as  the  subject for the continued         ing; the second call  has  "n05"  as  the  subject  for  the  continued
4603         (restarted) match.  Notice that when the match is  complete,  only  the         (restarted)  match.   Notice  that when the match is complete, only the
4604         last  part  is  shown;  PCRE  does not retain the previously partially-         last part is shown; PCRE does  not  retain  the  previously  partially-
4605         matched string. It is up to the calling program to do that if it  needs         matched  string. It is up to the calling program to do that if it needs
4606         to.         to.
4607    
4608         You  can  set  PCRE_PARTIAL  with  PCRE_DFA_RESTART to continue partial         You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial
4609         matching over multiple segments. This facility can be used to pass very         matching over multiple segments. This facility can be used to pass very
4610         long  subject  strings to pcre_dfa_exec(). However, some care is needed         long subject strings to pcre_dfa_exec(). However, some care  is  needed
4611         for certain types of pattern.         for certain types of pattern.
4612    
4613         1. If the pattern contains tests for the beginning or end  of  a  line,         1.  If  the  pattern contains tests for the beginning or end of a line,
4614         you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
4615         ate, when the subject string for any call does not contain  the  begin-         ate,  when  the subject string for any call does not contain the begin-
4616         ning or end of a line.         ning or end of a line.
4617    
4618         2.  If  the  pattern contains backward assertions (including \b or \B),         2. If the pattern contains backward assertions (including  \b  or  \B),
4619         you need to arrange for some overlap in the subject  strings  to  allow         you  need  to  arrange for some overlap in the subject strings to allow
4620         for  this.  For  example, you could pass the subject in chunks that are         for this. For example, you could pass the subject in  chunks  that  are
4621         500 bytes long, but in a buffer of 700 bytes, with the starting  offset         500  bytes long, but in a buffer of 700 bytes, with the starting offset
4622         set to 200 and the previous 200 bytes at the start of the buffer.         set to 200 and the previous 200 bytes at the start of the buffer.
4623    
4624         3.  Matching a subject string that is split into multiple segments does         3. Matching a subject string that is split into multiple segments  does
4625         not always produce exactly the same result as matching over one  single         not  always produce exactly the same result as matching over one single
4626         long  string.   The  difference arises when there are multiple matching         long string.  The difference arises when there  are  multiple  matching
4627         possibilities, because a partial match result is given only when  there         possibilities,  because a partial match result is given only when there
4628         are  no completed matches in a call to pcre_dfa_exec(). This means that         are no completed matches in a call to pcre_dfa_exec(). This means  that
4629         as soon as the shortest match has been found,  continuation  to  a  new         as  soon  as  the  shortest match has been found, continuation to a new
4630         subject segment is no longer possible.  Consider this pcretest example:         subject segment is no longer possible.  Consider this pcretest example:
4631    
4632             re> /dog(sbody)?/             re> /dog(sbody)?/
# Line 4540  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4638  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4638            0: dogsbody            0: dogsbody
4639            1: dog            1: dog
4640    
4641         The pattern matches the words "dog" or "dogsbody". When the subject  is         The  pattern matches the words "dog" or "dogsbody". When the subject is
4642         presented  in  several  parts  ("do" and "gsb" being the first two) the         presented in several parts ("do" and "gsb" being  the  first  two)  the
4643         match stops when "dog" has been found, and it is not possible  to  con-         match  stops  when "dog" has been found, and it is not possible to con-
4644         tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single         tinue. On the other hand,  if  "dogsbody"  is  presented  as  a  single
4645         string, both matches are found.         string, both matches are found.
4646    
4647         Because of this phenomenon, it does not usually make  sense  to  end  a         Because  of  this  phenomenon,  it does not usually make sense to end a
4648         pattern that is going to be matched in this way with a variable repeat.         pattern that is going to be matched in this way with a variable repeat.
4649    
4650         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
# Line 4555  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4653  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4653    
4654           1234|3789           1234|3789
4655    
4656         If the first part of the subject is "ABC123", a partial  match  of  the         If  the  first  part of the subject is "ABC123", a partial match of the
4657         first  alternative  is found at offset 3. There is no partial match for         first alternative is found at offset 3. There is no partial  match  for
4658         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
4659         point  in  the  subject  string. Attempting to continue with the string         point in the subject string. Attempting to  continue  with  the  string
4660         "789" does not yield a match because only those alternatives that match         "789" does not yield a match because only those alternatives that match
4661         at  one point in the subject are remembered. The problem arises because         at one point in the subject are remembered. The problem arises  because
4662         the start of the second alternative matches within the  first  alterna-         the  start  of the second alternative matches within the first alterna-
4663         tive. There is no problem with anchored patterns or patterns such as:         tive. There is no problem with anchored patterns or patterns such as:
4664    
4665           1234|ABCD           1234|ABCD
# Line 4578  AUTHOR Line 4676  AUTHOR
4676    
4677  REVISION  REVISION
4678    
4679         Last updated: 06 March 2007         Last updated: 04 June 2007
4680         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4681  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4682    
# Line 4603  SAVING AND RE-USING PRECOMPILED PCRE PAT Line 4701  SAVING AND RE-USING PRECOMPILED PCRE PAT
4701         ent  host  and  run them there. This works even if the new host has the         ent  host  and  run them there. This works even if the new host has the
4702         opposite endianness to the one on which  the  patterns  were  compiled.         opposite endianness to the one on which  the  patterns  were  compiled.
4703         There  may  be a small performance penalty, but it should be insignifi-         There  may  be a small performance penalty, but it should be insignifi-
4704         cant.         cant. However, compiling regular expressions with one version  of  PCRE
4705           for  use  with  a  different  version is not guaranteed to work and may
4706           cause crashes.
4707    
4708    
4709  SAVING A COMPILED PATTERN  SAVING A COMPILED PATTERN
# Line 4710  AUTHOR Line 4810  AUTHOR
4810    
4811  REVISION  REVISION
4812    
4813         Last updated: 06 March 2007         Last updated: 24 April 2007
4814         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4815  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4816    
# Line 5178  MATCHING INTERFACE Line 5278  MATCHING INTERFACE
5278         return false (because the empty string is not a valid number):         return false (because the empty string is not a valid number):
5279    
5280            int number;            int number;
5281            pcrecpp::RE::FullMatch("abc", "[a-z]+(\d+)?", &number);            pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
5282    
5283         The matching interface supports at most 16 arguments per call.  If  you         The matching interface supports at most 16 arguments per call.  If  you
5284         need    more,    consider    using    the    more   general   interface         need    more,    consider    using    the    more   general   interface

Legend:
Removed from v.148  
changed lines
  Added in v.172

  ViewVC Help
Powered by ViewVC 1.1.5