/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 566 by ph10, Fri Jun 25 14:42:00 2010 UTC revision 567 by ph10, Sat Nov 6 17:10:00 2010 UTC
# Line 2  Line 2 
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. Neither has  synopses of each function in the library have not been included. Neither has
6  the pcredemo program. There are separate text files for the pcregrep and  the pcredemo program. There are separate text files for the pcregrep and
7  pcretest commands.  pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
# Line 226  UTF-8 AND UNICODE PROPERTY SUPPORT Line 226  UTF-8 AND UNICODE PROPERTY SUPPORT
226         PCRE  recognizes  as digits, spaces, or word characters remain the same         PCRE  recognizes  as digits, spaces, or word characters remain the same
227         set as before, all with values less than 256. This  remains  true  even         set as before, all with values less than 256. This  remains  true  even
228         when  PCRE  is built to include Unicode property support, because to do         when  PCRE  is built to include Unicode property support, because to do
229         otherwise would slow down PCRE in many common  cases.  Note  that  this         otherwise would slow down PCRE in many common cases. Note in particular
230         also applies to \b, because it is defined in terms of \w and \W. If you         that this applies to \b and \B, because they are defined in terms of \w
231         really want to test for a wider sense of, say,  "digit",  you  can  use         and \W. If you really want to test for a wider sense of, say,  "digit",
232         explicit  Unicode property tests such as \p{Nd}.  Alternatively, if you         you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-
233         set the PCRE_UCP option, the way that the  character  escapes  work  is         tively, if you set the PCRE_UCP option,  the  way  that  the  character
234         changed  so that Unicode properties are used to determine which charac-         escapes  work  is changed so that Unicode properties are used to deter-
235         ters match. There are more details in the section on generic  character         mine which characters match. There are more details in the  section  on
236         types in the pcrepattern documentation.         generic character types in the pcrepattern documentation.
237    
238         7.  Similarly,  characters that match the POSIX named character classes         7.  Similarly,  characters that match the POSIX named character classes
239         are all low-valued characters, unless the PCRE_UCP option is set.         are all low-valued characters, unless the PCRE_UCP option is set.
# Line 267  AUTHOR Line 267  AUTHOR
267    
268  REVISION  REVISION
269    
270         Last updated: 12 May 2010         Last updated: 22 October 2010
271         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
272  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
273    
274    
275  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
276    
277    
# Line 601  REVISION Line 601  REVISION
601         Last updated: 29 September 2009         Last updated: 29 September 2009
602         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
603  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
604    
605    
606  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
607    
608    
# Line 771  ADVANTAGES OF THE ALTERNATIVE ALGORITHM Line 771  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
771         2.  Because  the  alternative  algorithm  scans the subject string just         2.  Because  the  alternative  algorithm  scans the subject string just
772         once, and never needs to backtrack, it is possible to  pass  very  long         once, and never needs to backtrack, it is possible to  pass  very  long
773         subject  strings  to  the matching function in several pieces, checking         subject  strings  to  the matching function in several pieces, checking
774         for partial matching each time.  The  pcrepartial  documentation  gives         for partial matching each time. It  is  possible  to  do  multi-segment
775         details of partial matching.         matching using pcre_exec() (by retaining partially matched substrings),
776           but it is more complicated. The pcrepartial documentation gives details
777           of partial matching and discusses multi-segment matching.
778    
779    
780  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
# Line 798  AUTHOR Line 800  AUTHOR
800    
801  REVISION  REVISION
802    
803         Last updated: 29 September 2009         Last updated: 22 October 2010
804         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
805  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
806    
807    
808  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
809    
810    
# Line 2123  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2125  MATCHING A PATTERN: THE TRADITIONAL FUNC
2125         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
2126         fails, by advancing the starting offset (see below) and trying an ordi-         fails, by advancing the starting offset (see below) and trying an ordi-
2127         nary  match  again. There is some code that demonstrates how to do this         nary  match  again. There is some code that demonstrates how to do this
2128         in the pcredemo sample program.         in the pcredemo sample program. In the most general case, you  have  to
2129           check  to  see  if the newline convention recognizes CRLF as a newline,
2130           and if so, and the current character is CR followed by LF, advance  the
2131           starting offset by two characters instead of one.
2132    
2133           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2134    
2135         There are a number of optimizations that pcre_exec() uses at the  start         There  are a number of optimizations that pcre_exec() uses at the start
2136         of  a  match,  in  order to speed up the process. For example, if it is         of a match, in order to speed up the process. For  example,  if  it  is
2137         known that an unanchored match must start with a specific character, it         known that an unanchored match must start with a specific character, it
2138         searches  the  subject  for that character, and fails immediately if it         searches the subject for that character, and fails  immediately  if  it
2139         cannot find it, without actually running the  main  matching  function.         cannot  find  it,  without actually running the main matching function.
2140         This means that a special item such as (*COMMIT) at the start of a pat-         This means that a special item such as (*COMMIT) at the start of a pat-
2141         tern is not considered until after a suitable starting  point  for  the         tern  is  not  considered until after a suitable starting point for the
2142         match  has been found. When callouts or (*MARK) items are in use, these         match has been found. When callouts or (*MARK) items are in use,  these
2143         "start-up" optimizations can cause them to be skipped if the pattern is         "start-up" optimizations can cause them to be skipped if the pattern is
2144         never  actually  used.  The start-up optimizations are in effect a pre-         never actually used. The start-up optimizations are in  effect  a  pre-
2145         scan of the subject that takes place before the pattern is run.         scan of the subject that takes place before the pattern is run.
2146    
2147         The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,         The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
2148         possibly  causing  performance  to  suffer,  but ensuring that in cases         possibly causing performance to suffer,  but  ensuring  that  in  cases
2149         where the result is "no match", the callouts do occur, and  that  items         where  the  result is "no match", the callouts do occur, and that items
2150         such as (*COMMIT) and (*MARK) are considered at every possible starting         such as (*COMMIT) and (*MARK) are considered at every possible starting
2151         position in the subject  string.   Setting  PCRE_NO_START_OPTIMIZE  can         position  in  the  subject  string.  Setting PCRE_NO_START_OPTIMIZE can
2152         change the outcome of a matching operation.  Consider the pattern         change the outcome of a matching operation.  Consider the pattern
2153    
2154           (*COMMIT)ABC           (*COMMIT)ABC
2155    
2156         When  this  is  compiled, PCRE records the fact that a match must start         When this is compiled, PCRE records the fact that a  match  must  start
2157         with the character "A". Suppose the subject  string  is  "DEFABC".  The         with  the  character  "A".  Suppose the subject string is "DEFABC". The
2158         start-up  optimization  scans along the subject, finds "A" and runs the         start-up optimization scans along the subject, finds "A" and  runs  the
2159         first match attempt from there. The (*COMMIT) item means that the  pat-         first  match attempt from there. The (*COMMIT) item means that the pat-
2160         tern  must  match the current starting position, which in this case, it         tern must match the current starting position, which in this  case,  it
2161         does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE         does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE
2162         set,  the  initial  scan  along the subject string does not happen. The         set, the initial scan along the subject string  does  not  happen.  The
2163         first match attempt is run starting  from  "D"  and  when  this  fails,         first  match  attempt  is  run  starting  from "D" and when this fails,
2164         (*COMMIT)  prevents  any  further  matches  being tried, so the overall         (*COMMIT) prevents any further matches  being  tried,  so  the  overall
2165         result is "no match". If the pattern is studied,  more  start-up  opti-         result  is  "no  match". If the pattern is studied, more start-up opti-
2166         mizations  may  be  used. For example, a minimum length for the subject         mizations may be used. For example, a minimum length  for  the  subject
2167         may be recorded. Consider the pattern         may be recorded. Consider the pattern
2168    
2169           (*MARK:A)(X|Y)           (*MARK:A)(X|Y)
2170    
2171         The minimum length for a match is one  character.  If  the  subject  is         The  minimum  length  for  a  match is one character. If the subject is
2172         "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then         "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then
2173         finally an empty string.  If the pattern is studied, the final  attempt         finally  an empty string.  If the pattern is studied, the final attempt
2174         does  not take place, because PCRE knows that the subject is too short,         does not take place, because PCRE knows that the subject is too  short,
2175         and so the (*MARK) is never encountered.  In this  case,  studying  the         and  so  the  (*MARK) is never encountered.  In this case, studying the
2176         pattern  does  not  affect the overall match result, which is still "no         pattern does not affect the overall match result, which  is  still  "no
2177         match", but it does affect the auxiliary information that is returned.         match", but it does affect the auxiliary information that is returned.
2178    
2179           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2180    
2181         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2182         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8 string is automatically checked when pcre_exec() is  subsequently
2183         called.  The value of startoffset is also checked  to  ensure  that  it         called.   The  value  of  startoffset is also checked to ensure that it
2184         points  to  the start of a UTF-8 character. There is a discussion about         points to the start of a UTF-8 character. There is a  discussion  about
2185         the validity of UTF-8 strings in the section on UTF-8  support  in  the         the  validity  of  UTF-8 strings in the section on UTF-8 support in the
2186         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,
2187         pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-         pcre_exec()  returns  the error PCRE_ERROR_BADUTF8. If startoffset con-
2188         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.         tains a value that does not point to the start of a UTF-8 character (or
2189           to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2190    
2191         If  you  already  know that your subject is valid, and you want to skip         If  you  already  know that your subject is valid, and you want to skip
2192         these   checks   for   performance   reasons,   you   can    set    the         these   checks   for   performance   reasons,   you   can    set    the
# Line 2188  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2194  MATCHING A PATTERN: THE TRADITIONAL FUNC
2194         do this for the second and subsequent calls to pcre_exec() if  you  are         do this for the second and subsequent calls to pcre_exec() if  you  are
2195         making  repeated  calls  to  find  all  the matches in a single subject         making  repeated  calls  to  find  all  the matches in a single subject
2196         string. However, you should be  sure  that  the  value  of  startoffset         string. However, you should be  sure  that  the  value  of  startoffset
2197         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is         points  to  the start of a UTF-8 character (or the end of the subject).
2198         set, the effect of passing an invalid UTF-8 string as a subject,  or  a         When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid  UTF-8
2199         value  of startoffset that does not point to the start of a UTF-8 char-         string  as  a  subject or an invalid value of startoffset is undefined.
2200         acter, is undefined. Your program may crash.         Your program may crash.
2201    
2202           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2203           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
# Line 2200  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2206  MATCHING A PATTERN: THE TRADITIONAL FUNC
2206         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2207         match occurs if the end of the subject string is reached  successfully,         match occurs if the end of the subject string is reached  successfully,
2208         but  there  are not enough subject characters to complete the match. If         but  there  are not enough subject characters to complete the match. If
2209         this happens when PCRE_PARTIAL_HARD  is  set,  pcre_exec()  immediately         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2210         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,         matching  continues  by  testing any remaining alternatives. Only if no
2211         matching continues by testing any other alternatives. Only if they  all         complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
2212         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).         PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
2213         The portion of the string that was inspected when the partial match was         caller is prepared to handle a partial match, but only if  no  complete
2214         found  is  set  as  the first matching string. There is a more detailed         match can be found.
2215         discussion in the pcrepartial documentation.  
2216           If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
2217           case, if a partial match  is  found,  pcre_exec()  immediately  returns
2218           PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
2219           other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
2220           ered to be more important that an alternative complete match.
2221    
2222           In  both  cases,  the portion of the string that was inspected when the
2223           partial match was found is set as the first matching string. There is a
2224           more  detailed  discussion  of partial and multi-segment matching, with
2225           examples, in the pcrepartial documentation.
2226    
2227     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2228    
2229         The subject string is passed to pcre_exec() as a pointer in subject,  a         The subject string is passed to pcre_exec() as a pointer in subject,  a
2230         length (in bytes) in length, and a starting byte offset in startoffset.         length (in bytes) in length, and a starting byte offset in startoffset.
2231           If this is  negative  or  greater  than  the  length  of  the  subject,
2232           pcre_exec() returns PCRE_ERROR_BADOFFSET.
2233    
2234         In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-         In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
2235         acter.  Unlike  the pattern string, the subject may contain binary zero         acter (or the end of the subject). Unlike the pattern string, the  sub-
2236         bytes. When the starting offset is zero, the search for a match  starts         ject  may  contain binary zero bytes. When the starting offset is zero,
2237         at  the  beginning  of  the subject, and this is by far the most common         the search for a match starts at the beginning of the subject, and this
2238         case.         is by far the most common case.
2239    
2240         A non-zero starting offset is useful when searching for  another  match         A  non-zero  starting offset is useful when searching for another match
2241         in  the same subject by calling pcre_exec() again after a previous suc-         in the same subject by calling pcre_exec() again after a previous  suc-
2242         cess.  Setting startoffset differs from just passing over  a  shortened         cess.   Setting  startoffset differs from just passing over a shortened
2243         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
2244         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2245    
2246           \Biss\B           \Biss\B
2247    
2248         which finds occurrences of "iss" in the middle of  words.  (\B  matches         which  finds  occurrences  of "iss" in the middle of words. (\B matches
2249         only  if  the  current position in the subject is not a word boundary.)         only if the current position in the subject is not  a  word  boundary.)
2250         When applied to the string "Mississipi" the first call  to  pcre_exec()         When  applied  to the string "Mississipi" the first call to pcre_exec()
2251         finds  the  first  occurrence. If pcre_exec() is called again with just         finds the first occurrence. If pcre_exec() is called  again  with  just
2252         the remainder of the subject,  namely  "issipi",  it  does  not  match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
2253         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2254         to be a word boundary. However, if pcre_exec()  is  passed  the  entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
2255         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2256         rence of "iss" because it is able to look behind the starting point  to         rence  of "iss" because it is able to look behind the starting point to
2257         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2258    
2259           Finding all the matches in a subject is tricky  when  the  pattern  can
2260           match an empty string. It is possible to emulate Perl's /g behaviour by
2261           first  trying  the  match  again  at  the   same   offset,   with   the
2262           PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that
2263           fails, advancing the starting  offset  and  trying  an  ordinary  match
2264           again. There is some code that demonstrates how to do this in the pcre-
2265           demo sample program. In the most general case, you have to check to see
2266           if  the newline convention recognizes CRLF as a newline, and if so, and
2267           the current character is CR followed by LF, advance the starting offset
2268           by two characters instead of one.
2269    
2270         If  a  non-zero starting offset is passed when the pattern is anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
2271         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2272         if  the  pattern  does  not require the match to be at the start of the         if  the  pattern  does  not require the match to be at the start of the
# Line 2420  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2450  MATCHING A PATTERN: THE TRADITIONAL FUNC
2450    
2451         An invalid combination of PCRE_NEWLINE_xxx options was given.         An invalid combination of PCRE_NEWLINE_xxx options was given.
2452    
2453             PCRE_ERROR_BADOFFSET      (-24)
2454    
2455           The value of startoffset was negative or greater than the length of the
2456           subject, that is, the value in length.
2457    
2458         Error numbers -16 to -20 and -22 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2459    
2460    
# Line 2436  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2471  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2471         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2472              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2473    
2474         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
2475         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
2476         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2477         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
2478         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
2479         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
2480         substrings.         substrings.
2481    
2482         A substring that contains a binary zero is correctly extracted and  has         A  substring that contains a binary zero is correctly extracted and has
2483         a  further zero added on the end, but the result is not, of course, a C         a further zero added on the end, but the result is not, of course, a  C
2484         string.  However, you can process such a string  by  referring  to  the         string.   However,  you  can  process such a string by referring to the
2485         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
2486         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2487         not  adequate for handling strings containing binary zeros, because the         not adequate for handling strings containing binary zeros, because  the
2488         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2489    
2490         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
2491         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
2492         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2493         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2494         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
2495         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2496         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
2497         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
2498         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2499    
2500         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
2501         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
2502         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
2503         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
2504         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
2505         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
2506         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
2507         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
2508         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2509    
2510           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2511    
2512         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
2513         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2514    
2515           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2516    
2517         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2518    
2519         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
2520         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
2521         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2522         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
2523         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
2524         pointer. The yield of the function is zero if all  went  well,  or  the         pointer.  The  yield  of  the function is zero if all went well, or the
2525         error code         error code
2526    
2527           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2528    
2529         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2530    
2531         When  any of these functions encounter a substring that is unset, which         When any of these functions encounter a substring that is unset,  which
2532         can happen when capturing subpattern number n+1 matches  some  part  of         can  happen  when  capturing subpattern number n+1 matches some part of
2533         the  subject, but subpattern n has not been used at all, they return an         the subject, but subpattern n has not been used at all, they return  an
2534         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2535         string  by inspecting the appropriate offset in ovector, which is nega-         string by inspecting the appropriate offset in ovector, which is  nega-
2536         tive for unset substrings.         tive for unset substrings.
2537    
2538         The two convenience functions pcre_free_substring() and  pcre_free_sub-         The  two convenience functions pcre_free_substring() and pcre_free_sub-
2539         string_list()  can  be  used  to free the memory returned by a previous         string_list() can be used to free the memory  returned  by  a  previous
2540         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2541         tively.  They  do  nothing  more  than  call the function pointed to by         tively. They do nothing more than  call  the  function  pointed  to  by
2542         pcre_free, which of course could be called directly from a  C  program.         pcre_free,  which  of course could be called directly from a C program.
2543         However,  PCRE is used in some situations where it is linked via a spe-         However, PCRE is used in some situations where it is linked via a  spe-
2544         cial  interface  to  another  programming  language  that  cannot   use         cial   interface  to  another  programming  language  that  cannot  use
2545         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free directly; it is for these cases that the functions  are  pro-
2546         vided.         vided.
2547    
2548    
# Line 2526  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2561  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2561              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2562              const char **stringptr);              const char **stringptr);
2563    
2564         To extract a substring by name, you first have to find associated  num-         To  extract a substring by name, you first have to find associated num-
2565         ber.  For example, for this pattern         ber.  For example, for this pattern
2566    
2567           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2535  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2570  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2570         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2571         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2572         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2573         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
2574         subpattern of that name.         subpattern of that name.
2575    
2576         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2577         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2578         are also two functions that do the whole job.         are also two functions that do the whole job.
2579    
2580         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most    of    the    arguments   of   pcre_copy_named_substring()   and
2581         pcre_get_named_substring()  are  the  same  as  those for the similarly         pcre_get_named_substring() are the same  as  those  for  the  similarly
2582         named functions that extract by number. As these are described  in  the         named  functions  that extract by number. As these are described in the
2583         previous  section,  they  are not re-described here. There are just two         previous section, they are not re-described here. There  are  just  two
2584         differences:         differences:
2585    
2586         First, instead of a substring number, a substring name is  given.  Sec-         First,  instead  of a substring number, a substring name is given. Sec-
2587         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2588         to the compiled pattern. This is needed in order to gain access to  the         to  the compiled pattern. This is needed in order to gain access to the
2589         name-to-number translation table.         name-to-number translation table.
2590    
2591         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These functions call pcre_get_stringnumber(), and if it succeeds,  they
2592         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
2593         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
2594         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
2595    
2596         Warning: If the pattern uses the (?| feature to set up multiple subpat-         Warning: If the pattern uses the (?| feature to set up multiple subpat-
2597         terns  with  the  same number, as described in the section on duplicate         terns with the same number, as described in the  section  on  duplicate
2598         subpattern numbers in the pcrepattern page, you  cannot  use  names  to         subpattern  numbers  in  the  pcrepattern page, you cannot use names to
2599         distinguish  the  different subpatterns, because names are not included         distinguish the different subpatterns, because names are  not  included
2600         in the compiled code. The matching process uses only numbers. For  this         in  the compiled code. The matching process uses only numbers. For this
2601         reason,  the  use of different names for subpatterns of the same number         reason, the use of different names for subpatterns of the  same  number
2602         causes an error at compile time.         causes an error at compile time.
2603    
2604    
# Line 2572  DUPLICATE SUBPATTERN NAMES Line 2607  DUPLICATE SUBPATTERN NAMES
2607         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2608              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2609    
2610         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2611         subpatterns  are not required to be unique. (Duplicate names are always         subpatterns are not required to be unique. (Duplicate names are  always
2612         allowed for subpatterns with the same number, created by using the  (?|         allowed  for subpatterns with the same number, created by using the (?|
2613         feature.  Indeed,  if  such subpatterns are named, they are required to         feature. Indeed, if such subpatterns are named, they  are  required  to
2614         use the same names.)         use the same names.)
2615    
2616         Normally, patterns with duplicate names are such that in any one match,         Normally, patterns with duplicate names are such that in any one match,
2617         only  one of the named subpatterns participates. An example is shown in         only one of the named subpatterns participates. An example is shown  in
2618         the pcrepattern documentation.         the pcrepattern documentation.
2619    
2620         When   duplicates   are   present,   pcre_copy_named_substring()    and         When    duplicates   are   present,   pcre_copy_named_substring()   and
2621         pcre_get_named_substring()  return the first substring corresponding to         pcre_get_named_substring() return the first substring corresponding  to
2622         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING         the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
2623         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()         (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
2624         function returns one of the numbers that are associated with the  name,         function  returns one of the numbers that are associated with the name,
2625         but it is not defined which it is.         but it is not defined which it is.
2626    
2627         If  you want to get full details of all captured substrings for a given         If you want to get full details of all captured substrings for a  given
2628         name, you must use  the  pcre_get_stringtable_entries()  function.  The         name,  you  must  use  the pcre_get_stringtable_entries() function. The
2629         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2630         third and fourth are pointers to variables which  are  updated  by  the         third  and  fourth  are  pointers to variables which are updated by the
2631         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2632         the name-to-number table  for  the  given  name.  The  function  itself         the  name-to-number  table  for  the  given  name.  The function itself
2633         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
2634         there are none. The format of the table is described above in the  sec-         there  are none. The format of the table is described above in the sec-
2635         tion  entitled  Information  about  a  pattern.  Given all the relevant         tion entitled Information about a  pattern.   Given  all  the  relevant
2636         entries for the name, you can extract each of their numbers, and  hence         entries  for the name, you can extract each of their numbers, and hence
2637         the captured data, if any.         the captured data, if any.
2638    
2639    
2640  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2641    
2642         The  traditional  matching  function  uses a similar algorithm to Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
2643         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2644         the  subject.  If you want to find all possible matches, or the longest         the subject. If you want to find all possible matches, or  the  longest
2645         possible match, consider using the alternative matching  function  (see         possible  match,  consider using the alternative matching function (see
2646         below)  instead.  If you cannot use the alternative function, but still         below) instead. If you cannot use the alternative function,  but  still
2647         need to find all possible matches, you can kludge it up by  making  use         need  to  find all possible matches, you can kludge it up by making use
2648         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2649         tation.         tation.
2650    
2651         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2652         tern.   When your callout function is called, extract and save the cur-         tern.  When your callout function is called, extract and save the  cur-
2653         rent matched substring. Then return  1,  which  forces  pcre_exec()  to         rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2654         backtrack  and  try other alternatives. Ultimately, when it runs out of         backtrack and try other alternatives. Ultimately, when it runs  out  of
2655         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2656    
2657    
# Line 2627  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2662  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2662              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2663              int *workspace, int wscount);              int *workspace, int wscount);
2664    
2665         The function pcre_dfa_exec()  is  called  to  match  a  subject  string         The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2666         against  a  compiled pattern, using a matching algorithm that scans the         against a compiled pattern, using a matching algorithm that  scans  the
2667         subject string just once, and does not backtrack.  This  has  different         subject  string  just  once, and does not backtrack. This has different
2668         characteristics  to  the  normal  algorithm, and is not compatible with         characteristics to the normal algorithm, and  is  not  compatible  with
2669         Perl. Some of the features of PCRE patterns are not  supported.  Never-         Perl.  Some  of the features of PCRE patterns are not supported. Never-
2670         theless,  there are times when this kind of matching can be useful. For         theless, there are times when this kind of matching can be useful.  For
2671         a discussion of the two matching algorithms, and  a  list  of  features         a  discussion  of  the  two matching algorithms, and a list of features
2672         that  pcre_dfa_exec() does not support, see the pcrematching documenta-         that pcre_dfa_exec() does not support, see the pcrematching  documenta-
2673         tion.         tion.
2674    
2675         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2676         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2677         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2678         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2679         repeated here.         repeated here.
2680    
2681         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2682         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2683         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2684         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2685         lot of potential matches.         lot of potential matches.
2686    
2687         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2668  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2703  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2703    
2704     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2705    
2706         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2707         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2708         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
2709         PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,         PCRE_NOTEMPTY_ATSTART,      PCRE_NO_UTF8_CHECK,       PCRE_BSR_ANYCRLF,
2710         PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-         PCRE_BSR_UNICODE,  PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
2711         TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last         TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.  All but  the  last
2712         four of these are  exactly  the  same  as  for  pcre_exec(),  so  their         four  of  these  are  exactly  the  same  as  for pcre_exec(), so their
2713         description is not repeated here.         description is not repeated here.
2714    
2715           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2716           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2717    
2718         These  have the same general effect as they do for pcre_exec(), but the         These have the same general effect as they do for pcre_exec(), but  the
2719         details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for         details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for
2720         pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-         pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-
2721         ject is reached and there is still at least  one  matching  possibility         ject  is  reached  and there is still at least one matching possibility
2722         that requires additional characters. This happens even if some complete         that requires additional characters. This happens even if some complete
2723         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2724         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2725         of the subject is reached, there have been  no  complete  matches,  but         of  the  subject  is  reached, there have been no complete matches, but
2726         there  is  still  at least one matching possibility. The portion of the         there is still at least one matching possibility. The  portion  of  the
2727         string that was inspected when the longest partial match was  found  is         string  that  was inspected when the longest partial match was found is
2728         set as the first matching string in both cases.         set as the first matching string  in  both  cases.   There  is  a  more
2729           detailed  discussion  of partial and multi-segment matching, with exam-
2730           ples, in the pcrepartial documentation.
2731    
2732           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2733    
2734         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2735         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2736         tive  algorithm  works, this is necessarily the shortest possible match         tive algorithm works, this is necessarily the shortest  possible  match
2737         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2738    
2739           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2740    
2741         When pcre_dfa_exec() returns a partial match, it is possible to call it         When pcre_dfa_exec() returns a partial match, it is possible to call it
2742         again,  with  additional  subject characters, and have it continue with         again, with additional subject characters, and have  it  continue  with
2743         the same match. The PCRE_DFA_RESTART option requests this action;  when         the  same match. The PCRE_DFA_RESTART option requests this action; when
2744         it  is  set,  the workspace and wscount options must reference the same         it is set, the workspace and wscount options must  reference  the  same
2745         vector as before because data about the match so far is  left  in  them         vector  as  before  because data about the match so far is left in them
2746         after a partial match. There is more discussion of this facility in the         after a partial match. There is more discussion of this facility in the
2747         pcrepartial documentation.         pcrepartial documentation.
2748    
2749     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2750    
2751         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
2752         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
2753         of the function start at the same point in  the  subject.  The  shorter         of  the  function  start  at the same point in the subject. The shorter
2754         matches  are all initial substrings of the longer matches. For example,         matches are all initial substrings of the longer matches. For  example,
2755         if the pattern         if the pattern
2756    
2757           <.*>           <.*>
# Line 2729  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2766  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2766           <something> <something else>           <something> <something else>
2767           <something> <something else> <something further>           <something> <something else> <something further>
2768    
2769         On success, the yield of the function is a number  greater  than  zero,         On  success,  the  yield of the function is a number greater than zero,
2770         which  is  the  number of matched substrings. The substrings themselves         which is the number of matched substrings.  The  substrings  themselves
2771         are returned in ovector. Each string uses two elements;  the  first  is         are  returned  in  ovector. Each string uses two elements; the first is
2772         the  offset  to  the start, and the second is the offset to the end. In         the offset to the start, and the second is the offset to  the  end.  In
2773         fact, all the strings have the same start  offset.  (Space  could  have         fact,  all  the  strings  have the same start offset. (Space could have
2774         been  saved by giving this only once, but it was decided to retain some         been saved by giving this only once, but it was decided to retain  some
2775         compatibility with the way pcre_exec() returns data,  even  though  the         compatibility  with  the  way pcre_exec() returns data, even though the
2776         meaning of the strings is different.)         meaning of the strings is different.)
2777    
2778         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2779         est matching string is given first. If there were too many  matches  to         est  matching  string is given first. If there were too many matches to
2780         fit  into ovector, the yield of the function is zero, and the vector is         fit into ovector, the yield of the function is zero, and the vector  is
2781         filled with the longest matches.         filled with the longest matches.
2782    
2783     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2784    
2785         The pcre_dfa_exec() function returns a negative number when  it  fails.         The  pcre_dfa_exec()  function returns a negative number when it fails.
2786         Many  of  the  errors  are  the  same as for pcre_exec(), and these are         Many of the errors are the same  as  for  pcre_exec(),  and  these  are
2787         described above.  There are in addition the following errors  that  are         described  above.   There are in addition the following errors that are
2788         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2789    
2790           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2791    
2792         This  return is given if pcre_dfa_exec() encounters an item in the pat-         This return is given if pcre_dfa_exec() encounters an item in the  pat-
2793         tern that it does not support, for instance, the use of \C  or  a  back         tern  that  it  does not support, for instance, the use of \C or a back
2794         reference.         reference.
2795    
2796           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2797    
2798         This  return  is  given  if pcre_dfa_exec() encounters a condition item         This return is given if pcre_dfa_exec()  encounters  a  condition  item
2799         that uses a back reference for the condition, or a test  for  recursion         that  uses  a back reference for the condition, or a test for recursion
2800         in a specific group. These are not supported.         in a specific group. These are not supported.
2801    
2802           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2803    
2804         This  return  is given if pcre_dfa_exec() is called with an extra block         This return is given if pcre_dfa_exec() is called with an  extra  block
2805         that contains a setting of the match_limit field. This is not supported         that contains a setting of the match_limit field. This is not supported
2806         (it is meaningless).         (it is meaningless).
2807    
2808           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2809    
2810         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
2811         workspace vector.         workspace vector.
2812    
2813           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
2814    
2815         When a recursive subpattern is processed, the matching  function  calls         When  a  recursive subpattern is processed, the matching function calls
2816         itself  recursively,  using  private vectors for ovector and workspace.         itself recursively, using private vectors for  ovector  and  workspace.
2817         This error is given if the output vector  is  not  large  enough.  This         This  error  is  given  if  the output vector is not large enough. This
2818         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2819    
2820    
2821  SEE ALSO  SEE ALSO
2822    
2823         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-         pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
2824         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2825    
2826    
# Line 2796  AUTHOR Line 2833  AUTHOR
2833    
2834  REVISION  REVISION
2835    
2836         Last updated: 21 June 2010         Last updated: 06 November 2010
2837         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
2838  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2839    
2840    
2841  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2842    
2843    
# Line 2980  REVISION Line 3017  REVISION
3017         Last updated: 29 September 2009         Last updated: 29 September 2009
3018         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
3019  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3020    
3021    
3022  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
3023    
3024    
# Line 2993  DIFFERENCES BETWEEN PCRE AND PERL Line 3030  DIFFERENCES BETWEEN PCRE AND PERL
3030    
3031         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
3032         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
3033         respect to Perl 5.10/5.11.         respect to Perl versions 5.10 and above.
3034    
3035         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
3036         of what it does have are given in the section on UTF-8 support  in  the         of what it does have are given in the section on UTF-8 support  in  the
# Line 3075  DIFFERENCES BETWEEN PCRE AND PERL Line 3112  DIFFERENCES BETWEEN PCRE AND PERL
3112         turing subpattern number 1. To avoid this confusing situation, an error         turing subpattern number 1. To avoid this confusing situation, an error
3113         is given at compile time.         is given at compile time.
3114    
3115         12. PCRE provides some extensions to the Perl regular expression facil-         12. Perl recognizes comments in some  places  that  PCRE  doesn't,  for
3116         ities.   Perl  5.10  includes new features that are not in earlier ver-         example, between the ( and ? at the start of a subpattern.
3117         sions of Perl, some of which (such as named parentheses) have  been  in  
3118           13. PCRE provides some extensions to the Perl regular expression facil-
3119           ities.  Perl 5.10 includes new features that are not  in  earlier  ver-
3120           sions  of  Perl, some of which (such as named parentheses) have been in
3121         PCRE for some time. This list is with respect to Perl 5.10:         PCRE for some time. This list is with respect to Perl 5.10:
3122    
3123         (a)  Although  lookbehind  assertions  in  PCRE must match fixed length         (a) Although lookbehind assertions in  PCRE  must  match  fixed  length
3124         strings, each alternative branch of a lookbehind assertion can match  a         strings,  each alternative branch of a lookbehind assertion can match a
3125         different  length  of  string.  Perl requires them all to have the same         different length of string. Perl requires them all  to  have  the  same
3126         length.         length.
3127    
3128         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
3129         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
3130    
3131         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
3132         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
3133         ignored.  (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
3134    
3135         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
3136         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
3137         lowed by a question mark they are.         lowed by a question mark they are.
3138    
# Line 3100  DIFFERENCES BETWEEN PCRE AND PERL Line 3140  DIFFERENCES BETWEEN PCRE AND PERL
3140         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
3141    
3142         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3143         and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-         and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
3144         lents.         lents.
3145    
3146         (g) The \R escape sequence can be restricted to match only CR,  LF,  or         (g)  The  \R escape sequence can be restricted to match only CR, LF, or
3147         CRLF by the PCRE_BSR_ANYCRLF option.         CRLF by the PCRE_BSR_ANYCRLF option.
3148    
3149         (h) The callout facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
# Line 3113  DIFFERENCES BETWEEN PCRE AND PERL Line 3153  DIFFERENCES BETWEEN PCRE AND PERL
3153         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3154         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
3155    
3156         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
3157         different way and is not Perl-compatible.         different way and is not Perl-compatible.
3158    
3159         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
3160         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
3161         pattern.         pattern.
3162    
# Line 3130  AUTHOR Line 3170  AUTHOR
3170    
3171  REVISION  REVISION
3172    
3173         Last updated: 12 May 2010         Last updated: 31 October 2010
3174         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
3175  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3176    
3177    
3178  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
3179    
3180    
# Line 3324  BACKSLASH Line 3364  BACKSLASH
3364           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3365    
3366         The  \Q...\E  sequence  is recognized both inside and outside character         The  \Q...\E  sequence  is recognized both inside and outside character
3367         classes.         classes.  An isolated \E that is not preceded by \Q is ignored.
3368    
3369     Non-printing characters     Non-printing characters
3370    
# Line 4862  CONDITIONAL SUBPATTERNS Line 4902  CONDITIONAL SUBPATTERNS
4902    
4903         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
4904         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4905         tives in the subpattern, a compile-time error occurs.         tives  in  the subpattern, a compile-time error occurs. Each of the two
4906           alternatives may itself contain nested subpatterns of any form, includ-
4907           ing  conditional  subpatterns;  the  restriction  to  two  alternatives
4908           applies only at the level of the condition. This pattern fragment is an
4909           example where the alternatives are complex:
4910    
4911             (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
4912    
4913    
4914         There  are  four  kinds of condition: references to subpatterns, refer-         There  are  four  kinds of condition: references to subpatterns, refer-
4915         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
# Line 4987  CONDITIONAL SUBPATTERNS Line 5034  CONDITIONAL SUBPATTERNS
5034    
5035  COMMENTS  COMMENTS
5036    
5037         The  sequence (?# marks the start of a comment that continues up to the         There are two ways of including comments in patterns that are processed
5038         next closing parenthesis. Nested parentheses  are  not  permitted.  The         by PCRE. In both cases, the start of the comment must not be in a char-
5039         characters  that make up a comment play no part in the pattern matching         acter class, nor in the middle of any other sequence of related charac-
5040         at all.         ters such as (?: or a subpattern name or number.  The  characters  that
5041           make up a comment play no part in the pattern matching.
5042    
5043         If the PCRE_EXTENDED option is set, an unescaped # character outside  a         The  sequence (?# marks the start of a comment that continues up to the
5044         character  class  introduces  a  comment  that continues to immediately         next closing parenthesis. Nested parentheses are not permitted. If  the
5045         after the next newline in the pattern.         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5046           comment, which in this case continues to  immediately  after  the  next
5047           newline  character  or character sequence in the pattern. Which charac-
5048           ters are interpreted as newlines is controlled by the options passed to
5049           pcre_compile() or by a special sequence at the start of the pattern, as
5050           described in the section entitled  "Newline  conventions"  above.  Note
5051           that  end  of this type of comment is a literal newline sequence in the
5052           pattern; escape sequences that happen to represent  a  newline  do  not
5053           count.   For  example, consider this pattern when PCRE_EXTENDED is set,
5054           and the default newline convention is in force:
5055    
5056             abc #comment \n still comment
5057    
5058           On encountering the # character, pcre_compile()  skips  along,  looking
5059           for  a newline in the pattern. The sequence \n is still literal at this
5060           stage, so it does not terminate the comment. Only an  actual  character
5061           with the code value 0x0a (the default newline) does so.
5062    
5063    
5064  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5065    
5066         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
5067         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
5068         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
5069         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5070         depth.         depth.
5071    
5072         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5073         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
5074         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
5075         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5076         parentheses problem can be created like this:         parentheses problem can be created like this:
5077    
# Line 5017  RECURSIVE PATTERNS Line 5081  RECURSIVE PATTERNS
5081         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5082    
5083         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5084         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
5085         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
5086         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
5087         into Perl at release 5.10.         into Perl at release 5.10.
5088    
5089         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
5090         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
5091         the  given  number, provided that it occurs inside that subpattern. (If         the given number, provided that it occurs inside that  subpattern.  (If
5092         not, it is a "subroutine" call, which is described  in  the  next  sec-         not,  it  is  a  "subroutine" call, which is described in the next sec-
5093         tion.)  The special item (?R) or (?0) is a recursive call of the entire         tion.) The special item (?R) or (?0) is a recursive call of the  entire
5094         regular expression.         regular expression.
5095    
5096         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5097         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5098    
5099           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5100    
5101         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
5102         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
5103         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
5104         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5105         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5106         parentheses.         parentheses.
5107    
5108         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
5109         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5110    
5111           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5112    
5113         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
5114         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5115    
5116         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5117         tricky.  This  is made easier by the use of relative references (a Perl         tricky. This is made easier by the use of relative references  (a  Perl
5118         5.10 feature).  Instead of (?1) in the  pattern  above  you  can  write         5.10  feature).   Instead  of  (?1)  in the pattern above you can write
5119         (?-2) to refer to the second most recently opened parentheses preceding         (?-2) to refer to the second most recently opened parentheses preceding
5120         the recursion. In other  words,  a  negative  number  counts  capturing         the  recursion.  In  other  words,  a  negative number counts capturing
5121         parentheses leftwards from the point at which it is encountered.         parentheses leftwards from the point at which it is encountered.
5122    
5123         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
5124         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
5125         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
5126         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
5127         section.         section.
5128    
5129         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
5130         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5131         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5132    
5133           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5134    
5135         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
5136         one is used.         one is used.
5137    
5138         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
5139         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5140         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5141         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
5142         applied to         applied to
5143    
5144           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5145    
5146         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
5147         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
5148         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
5149         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5150    
5151         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
5152         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
5153         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
5154         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5155    
5156           (ab(cd)ef)           (ab(cd)ef)
5157    
5158         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5159         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
5160         pattern is not matched at the top level, its final value is unset, even         pattern is not matched at the top level, its final value is unset, even
5161         if it is (temporarily) set at a deeper level.         if it is (temporarily) set at a deeper level.
5162    
5163         If  there are more than 15 capturing parentheses in a pattern, PCRE has         If there are more than 15 capturing parentheses in a pattern, PCRE  has
5164         to obtain extra memory to store data during a recursion, which it  does         to  obtain extra memory to store data during a recursion, which it does
5165         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5166         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5167    
5168         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
5169         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
5170         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
5171         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
5172         ted at the outer level.         ted at the outer level.
5173    
5174           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5175    
5176         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
5177         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
5178         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5179    
5180     Recursion difference from Perl     Recursion difference from Perl
5181    
5182         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
5183         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5184         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5185         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
5186         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
5187         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
5188         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5189    
5190           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5191    
5192         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5193         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
5194         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
5195         Consider the subject string "abcba":         Consider the subject string "abcba":
5196    
5197         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
5198         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5199         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5200         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
5201         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
5202    
5203         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
5204         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
5205         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
5206         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
5207         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
5208         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
5209         different:         different:
5210    
5211           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
5212    
5213         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
5214         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
5215         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
5216         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
5217         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
5218         use.         use.
5219    
5220         To change the pattern so that matches all palindromic strings, not just         To change the pattern so that matches all palindromic strings, not just
5221         those  with  an  odd number of characters, it is tempting to change the         those with an odd number of characters, it is tempting  to  change  the
5222         pattern to this:         pattern to this:
5223    
5224           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
5225    
5226         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
5227         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
5228         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
5229         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
5230         natives at the higher level:         natives at the higher level:
5231    
5232           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
5233    
5234         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
5235         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
5236    
5237           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5238    
5239         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5240         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5241         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
5242         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
5243         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
5244         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5245    
5246         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
5247         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
5248         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
5249         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
5250         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
5251         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
5252         natives, so the entire match fails.         natives, so the entire match fails.
5253    
5254    
5255  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5256    
5257         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
5258         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
5259         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
5260         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
5261         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
5262    
# Line 5204  SUBPATTERNS AS SUBROUTINES Line 5268  SUBPATTERNS AS SUBROUTINES
5268    
5269           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5270    
5271         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5272         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5273    
5274           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5275    
5276         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
5277         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
5278         above.         above.
5279    
5280         Like  recursive  subpatterns, a subroutine call is always treated as an         Like recursive subpatterns, a subroutine call is always treated  as  an
5281         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
5282         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
5283         there is a subsequent matching failure. Any capturing parentheses  that         there  is a subsequent matching failure. Any capturing parentheses that
5284         are  set  during  the  subroutine  call revert to their previous values         are set during the subroutine call  revert  to  their  previous  values
5285         afterwards.         afterwards.
5286    
5287         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
5288         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
5289         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5290    
5291           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5292    
5293         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
5294         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5295    
5296    
5297  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5298    
5299         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
5300         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5301         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
5302         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
5303         ten using this syntax:         ten using this syntax:
5304    
5305           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5306           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5307    
5308         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
5309         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5310    
5311           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5312    
5313         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
5314         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
5315         call.         call.
5316    
5317    
5318  CALLOUTS  CALLOUTS
5319    
5320         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5321         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
5322         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5323         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5324         tion.         tion.
5325    
5326         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5327         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5328         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
5329         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
5330         all calling out.         all calling out.
5331    
5332         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
5333         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
5334         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
5335         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
5336         points:         points:
5337    
5338           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5339    
5340         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5341         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
5342         numbered 255.         numbered 255.
5343    
5344         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5345         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
5346         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
5347         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
5348         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
5349         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5350         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5351    
5352    
5353  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5354    
5355         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5356         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5357         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5358         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5359         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5360         in this section.         in this section.
5361    
5362         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5363         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5364         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5365         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5366         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5367    
5368         If any of these verbs are used in an assertion or subroutine subpattern         If any of these verbs are used in an assertion or subroutine subpattern
5369         (including recursive subpatterns), their effect  is  confined  to  that         (including  recursive  subpatterns),  their  effect is confined to that
5370         subpattern;  it  does  not extend to the surrounding pattern. Note that         subpattern; it does not extend to the surrounding  pattern.  Note  that
5371         such subpatterns are processed as anchored at the point where they  are         such  subpatterns are processed as anchored at the point where they are
5372         tested.         tested.
5373    
5374         The  new verbs make use of what was previously invalid syntax: an open-         The new verbs make use of what was previously invalid syntax: an  open-
5375         ing parenthesis followed by an asterisk. They are generally of the form         ing parenthesis followed by an asterisk. They are generally of the form
5376         (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-         (*VERB) or (*VERB:NAME). Some may take either form, with differing  be-
5377         haviour, depending on whether or not an argument is present. An name is         haviour, depending on whether or not an argument is present. An name is
5378         a  sequence  of letters, digits, and underscores. If the name is empty,         a sequence of letters, digits, and underscores. If the name  is  empty,
5379         that is, if the closing parenthesis immediately follows the colon,  the         that  is, if the closing parenthesis immediately follows the colon, the
5380         effect is as if the colon were not there. Any number of these verbs may         effect is as if the colon were not there. Any number of these verbs may
5381         occur in a pattern.         occur in a pattern.
5382    
5383         PCRE contains some optimizations that are used to speed up matching  by         PCRE  contains some optimizations that are used to speed up matching by
5384         running some checks at the start of each match attempt. For example, it         running some checks at the start of each match attempt. For example, it
5385         may know the minimum length of matching subject, or that  a  particular         may  know  the minimum length of matching subject, or that a particular
5386         character  must  be present. When one of these optimizations suppresses         character must be present. When one of these  optimizations  suppresses
5387         the running of a match, any included backtracking verbs  will  not,  of         the  running  of  a match, any included backtracking verbs will not, of
5388         course, be processed. You can suppress the start-of-match optimizations         course, be processed. You can suppress the start-of-match optimizations
5389         by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().         by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().
5390    
5391     Verbs that act immediately     Verbs that act immediately
5392    
5393         The following verbs act as soon as they are encountered. They  may  not         The  following  verbs act as soon as they are encountered. They may not
5394         be followed by a name.         be followed by a name.
5395    
5396            (*ACCEPT)            (*ACCEPT)
5397    
5398         This  verb causes the match to end successfully, skipping the remainder         This verb causes the match to end successfully, skipping the  remainder
5399         of the pattern. When inside a recursion, only the innermost pattern  is         of  the pattern. When inside a recursion, only the innermost pattern is
5400         ended  immediately.  If  (*ACCEPT) is inside capturing parentheses, the         ended immediately. If (*ACCEPT) is inside  capturing  parentheses,  the
5401         data so far is captured. (This feature was added  to  PCRE  at  release         data  so  far  is  captured. (This feature was added to PCRE at release
5402         8.00.) For example:         8.00.) For example:
5403    
5404           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5405    
5406         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
5407         tured by the outer parentheses.         tured by the outer parentheses.
5408    
5409           (*FAIL) or (*F)           (*FAIL) or (*F)
5410    
5411         This verb causes the match to fail, forcing backtracking to  occur.  It         This  verb  causes the match to fail, forcing backtracking to occur. It
5412         is  equivalent to (?!) but easier to read. The Perl documentation notes         is equivalent to (?!) but easier to read. The Perl documentation  notes
5413         that it is probably useful only when combined  with  (?{})  or  (??{}).         that  it  is  probably  useful only when combined with (?{}) or (??{}).
5414         Those  are,  of course, Perl features that are not present in PCRE. The         Those are, of course, Perl features that are not present in  PCRE.  The
5415         nearest equivalent is the callout feature, as for example in this  pat-         nearest  equivalent is the callout feature, as for example in this pat-
5416         tern:         tern:
5417    
5418           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5419    
5420         A  match  with the string "aaaa" always fails, but the callout is taken         A match with the string "aaaa" always fails, but the callout  is  taken
5421         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5422    
5423     Recording which path was taken     Recording which path was taken
5424    
5425         There is one verb whose main purpose  is  to  track  how  a  match  was         There  is  one  verb  whose  main  purpose  is to track how a match was
5426         arrived  at,  though  it  also  has a secondary use in conjunction with         arrived at, though it also has a  secondary  use  in  conjunction  with
5427         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
5428    
5429           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
5430    
5431         A name is always  required  with  this  verb.  There  may  be  as  many         A  name  is  always  required  with  this  verb.  There  may be as many
5432         instances  of  (*MARK) as you like in a pattern, and their names do not         instances of (*MARK) as you like in a pattern, and their names  do  not
5433         have to be unique.         have to be unique.
5434    
5435         When a match succeeds, the name  of  the  last-encountered  (*MARK)  is         When  a  match  succeeds,  the  name of the last-encountered (*MARK) is
5436         passed  back  to  the  caller  via  the  pcre_extra  data structure, as         passed back to  the  caller  via  the  pcre_extra  data  structure,  as
5437         described in the section on pcre_extra in the pcreapi documentation. No         described in the section on pcre_extra in the pcreapi documentation. No
5438         data  is  returned  for a partial match. Here is an example of pcretest         data is returned for a partial match. Here is an  example  of  pcretest
5439         output, where the /K modifier requests the retrieval and outputting  of         output,  where the /K modifier requests the retrieval and outputting of
5440         (*MARK) data:         (*MARK) data:
5441    
5442           /X(*MARK:A)Y|X(*MARK:B)Z/K           /X(*MARK:A)Y|X(*MARK:B)Z/K
# Line 5384  BACKTRACKING CONTROL Line 5448  BACKTRACKING CONTROL
5448           MK: B           MK: B
5449    
5450         The (*MARK) name is tagged with "MK:" in this output, and in this exam-         The (*MARK) name is tagged with "MK:" in this output, and in this exam-
5451         ple it indicates which of the two alternatives matched. This is a  more         ple  it indicates which of the two alternatives matched. This is a more
5452         efficient  way of obtaining this information than putting each alterna-         efficient way of obtaining this information than putting each  alterna-
5453         tive in its own capturing parentheses.         tive in its own capturing parentheses.
5454    
5455         A name may also be returned after a failed  match  if  the  final  path         A  name  may  also  be  returned after a failed match if the final path
5456         through  the  pattern involves (*MARK). However, unless (*MARK) used in         through the pattern involves (*MARK). However, unless (*MARK)  used  in
5457         conjunction with (*COMMIT), this is unlikely to  happen  for  an  unan-         conjunction  with  (*COMMIT),  this  is unlikely to happen for an unan-
5458         chored pattern because, as the starting point for matching is advanced,         chored pattern because, as the starting point for matching is advanced,
5459         the final check is often with an empty string, causing a failure before         the final check is often with an empty string, causing a failure before
5460         (*MARK) is reached. For example:         (*MARK) is reached. For example:
# Line 5400  BACKTRACKING CONTROL Line 5464  BACKTRACKING CONTROL
5464           No match           No match
5465    
5466         There are three potential starting points for this match (starting with         There are three potential starting points for this match (starting with
5467         X, starting with P, and with  an  empty  string).  If  the  pattern  is         X,  starting  with  P,  and  with  an  empty string). If the pattern is
5468         anchored, the result is different:         anchored, the result is different:
5469    
5470           /^X(*MARK:A)Y|^X(*MARK:B)Z/K           /^X(*MARK:A)Y|^X(*MARK:B)Z/K
5471           XP           XP
5472           No match, mark = B           No match, mark = B
5473    
5474         PCRE's  start-of-match  optimizations can also interfere with this. For         PCRE's start-of-match optimizations can also interfere with  this.  For
5475         example, if, as a result of a call to pcre_study(), it knows the  mini-         example,  if, as a result of a call to pcre_study(), it knows the mini-
5476         mum  subject  length for a match, a shorter subject will not be scanned         mum subject length for a match, a shorter subject will not  be  scanned
5477         at all.         at all.
5478    
5479         Note that similar anomalies (though different in detail) exist in Perl,         Note that similar anomalies (though different in detail) exist in Perl,
5480         no  doubt  for the same reasons. The use of (*MARK) data after a failed         no doubt for the same reasons. The use of (*MARK) data after  a  failed
5481         match of an unanchored pattern is not recommended, unless (*COMMIT)  is         match  of an unanchored pattern is not recommended, unless (*COMMIT) is
5482         involved.         involved.
5483    
5484     Verbs that act after backtracking     Verbs that act after backtracking
5485    
5486         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5487         tinues with what follows, but if there is no subsequent match,  causing         tinues  with what follows, but if there is no subsequent match, causing
5488         a  backtrack  to  the  verb, a failure is forced. That is, backtracking         a backtrack to the verb, a failure is  forced.  That  is,  backtracking
5489         cannot pass to the left of the verb. However, when one of  these  verbs         cannot  pass  to the left of the verb. However, when one of these verbs
5490         appears  inside  an atomic group, its effect is confined to that group,         appears inside an atomic group, its effect is confined to  that  group,
5491         because once the group has been matched, there is never any  backtrack-         because  once the group has been matched, there is never any backtrack-
5492         ing  into  it.  In  this situation, backtracking can "jump back" to the         ing into it. In this situation, backtracking can  "jump  back"  to  the
5493         left of the entire atomic group. (Remember also, as stated above,  that         left  of the entire atomic group. (Remember also, as stated above, that
5494         this localization also applies in subroutine calls and assertions.)         this localization also applies in subroutine calls and assertions.)
5495    
5496         These  verbs  differ  in exactly what kind of failure occurs when back-         These verbs differ in exactly what kind of failure  occurs  when  back-
5497         tracking reaches them.         tracking reaches them.
5498    
5499           (*COMMIT)           (*COMMIT)
5500    
5501         This verb, which may not be followed by a name, causes the whole  match         This  verb, which may not be followed by a name, causes the whole match
5502         to fail outright if the rest of the pattern does not match. Even if the         to fail outright if the rest of the pattern does not match. Even if the
5503         pattern is unanchored, no further attempts to find a match by advancing         pattern is unanchored, no further attempts to find a match by advancing
5504         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
5505         pcre_exec() is committed to finding a match  at  the  current  starting         pcre_exec()  is  committed  to  finding a match at the current starting
5506         point, or not at all. For example:         point, or not at all. For example:
5507    
5508           a+(*COMMIT)b           a+(*COMMIT)b
5509    
5510         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
5511         of dynamic anchor, or "I've started, so I must finish." The name of the         of dynamic anchor, or "I've started, so I must finish." The name of the
5512         most  recently passed (*MARK) in the path is passed back when (*COMMIT)         most recently passed (*MARK) in the path is passed back when  (*COMMIT)
5513         forces a match failure.         forces a match failure.
5514    
5515         Note that (*COMMIT) at the start of a pattern is not  the  same  as  an         Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
5516         anchor,  unless  PCRE's start-of-match optimizations are turned off, as         anchor, unless PCRE's start-of-match optimizations are turned  off,  as
5517         shown in this pcretest example:         shown in this pcretest example:
5518    
5519           /(*COMMIT)abc/           /(*COMMIT)abc/
# Line 5458  BACKTRACKING CONTROL Line 5522  BACKTRACKING CONTROL
5522           xyzabc\Y           xyzabc\Y
5523           No match           No match
5524    
5525         PCRE knows that any match must start  with  "a",  so  the  optimization         PCRE  knows  that  any  match  must start with "a", so the optimization
5526         skips  along the subject to "a" before running the first match attempt,         skips along the subject to "a" before running the first match  attempt,
5527         which succeeds. When the optimization is disabled by the \Y  escape  in         which  succeeds.  When the optimization is disabled by the \Y escape in
5528         the second subject, the match starts at "x" and so the (*COMMIT) causes         the second subject, the match starts at "x" and so the (*COMMIT) causes
5529         it to fail without trying any other starting points.         it to fail without trying any other starting points.
5530    
5531           (*PRUNE) or (*PRUNE:NAME)           (*PRUNE) or (*PRUNE:NAME)
5532    
5533         This verb causes the match to fail at the current starting position  in         This  verb causes the match to fail at the current starting position in
5534         the  subject  if the rest of the pattern does not match. If the pattern         the subject if the rest of the pattern does not match. If  the  pattern
5535         is unanchored, the normal "bumpalong"  advance  to  the  next  starting         is  unanchored,  the  normal  "bumpalong"  advance to the next starting
5536         character  then happens. Backtracking can occur as usual to the left of         character then happens. Backtracking can occur as usual to the left  of
5537         (*PRUNE), before it is reached,  or  when  matching  to  the  right  of         (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
5538         (*PRUNE),  but  if  there is no match to the right, backtracking cannot         (*PRUNE), but if there is no match to the  right,  backtracking  cannot
5539         cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-         cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
5540         native  to an atomic group or possessive quantifier, but there are some         native to an atomic group or possessive quantifier, but there are  some
5541         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
5542         iour  of  (*PRUNE:NAME)  is  the  same as (*MARK:NAME)(*PRUNE) when the         iour of (*PRUNE:NAME) is the  same  as  (*MARK:NAME)(*PRUNE)  when  the
5543         match fails completely; the name is passed back if this  is  the  final         match  fails  completely;  the name is passed back if this is the final
5544         attempt.   (*PRUNE:NAME)  does  not  pass back a name if the match suc-         attempt.  (*PRUNE:NAME) does not pass back a name  if  the  match  suc-
5545         ceeds. In an anchored pattern (*PRUNE) has the same  effect  as  (*COM-         ceeds.  In  an  anchored pattern (*PRUNE) has the same effect as (*COM-
5546         MIT).         MIT).
5547    
5548           (*SKIP)           (*SKIP)
5549    
5550         This  verb, when given without a name, is like (*PRUNE), except that if         This verb, when given without a name, is like (*PRUNE), except that  if
5551         the pattern is unanchored, the "bumpalong" advance is not to  the  next         the  pattern  is unanchored, the "bumpalong" advance is not to the next
5552         character, but to the position in the subject where (*SKIP) was encoun-         character, but to the position in the subject where (*SKIP) was encoun-
5553         tered. (*SKIP) signifies that whatever text was matched leading  up  to         tered.  (*SKIP)  signifies that whatever text was matched leading up to
5554         it cannot be part of a successful match. Consider:         it cannot be part of a successful match. Consider:
5555    
5556           a+(*SKIP)b           a+(*SKIP)b
5557    
5558         If  the  subject  is  "aaaac...",  after  the first match attempt fails         If the subject is "aaaac...",  after  the  first  match  attempt  fails
5559         (starting at the first character in the  string),  the  starting  point         (starting  at  the  first  character in the string), the starting point
5560         skips on to start the next attempt at "c". Note that a possessive quan-         skips on to start the next attempt at "c". Note that a possessive quan-
5561         tifer does not have the same effect as this example; although it  would         tifer  does not have the same effect as this example; although it would
5562         suppress  backtracking  during  the  first  match  attempt,  the second         suppress backtracking  during  the  first  match  attempt,  the  second
5563         attempt would start at the second character instead of skipping  on  to         attempt  would  start at the second character instead of skipping on to
5564         "c".         "c".
5565    
5566           (*SKIP:NAME)           (*SKIP:NAME)
5567    
5568         When  (*SKIP) has an associated name, its behaviour is modified. If the         When (*SKIP) has an associated name, its behaviour is modified. If  the
5569         following pattern fails to match, the previous path through the pattern         following pattern fails to match, the previous path through the pattern
5570         is  searched for the most recent (*MARK) that has the same name. If one         is searched for the most recent (*MARK) that has the same name. If  one
5571         is found, the "bumpalong" advance is to the subject position that  cor-         is  found, the "bumpalong" advance is to the subject position that cor-
5572         responds  to  that (*MARK) instead of to where (*SKIP) was encountered.         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
5573         If no (*MARK) with a matching name is found, normal "bumpalong" of  one         If  no (*MARK) with a matching name is found, normal "bumpalong" of one
5574         character happens (the (*SKIP) is ignored).         character happens (the (*SKIP) is ignored).
5575    
5576           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
5577    
5578         This verb causes a skip to the next alternation if the rest of the pat-         This verb causes a skip  to  the  next  alternation  in  the  innermost
5579         tern does not match. That is, it cancels pending backtracking, but only         enclosing  group if the rest of the pattern does not match. That is, it
5580         within  the  current  alternation.  Its name comes from the observation         cancels pending backtracking, but only within the current  alternation.
5581         that it can be used for a pattern-based if-then-else block:         Its  name comes from the observation that it can be used for a pattern-
5582           based if-then-else block:
5583    
5584           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5585    
# Line 5525  BACKTRACKING CONTROL Line 5590  BACKTRACKING CONTROL
5590         (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not         (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not
5591         directly inside an alternation, it acts like (*PRUNE).         directly inside an alternation, it acts like (*PRUNE).
5592    
5593           The above verbs provide four different "strengths" of control when sub-
5594           sequent matching fails. (*THEN) is the weakest, carrying on  the  match
5595           at  the next alternation. (*PRUNE) comes next, failing the match at the
5596           current starting position, but allowing an advance to the next  charac-
5597           ter  (for  an  unanchored pattern). (*SKIP) is similar, except that the
5598           advance may be more than one character.  (*COMMIT)  is  the  strongest,
5599           causing the entire match to fail.
5600    
5601           If  more than one is present in a pattern, the "stongest" one wins. For
5602           example, consider this pattern, where A, B, etc.  are  complex  pattern
5603           fragments:
5604    
5605             (A(*COMMIT)B(*THEN)C|D)
5606    
5607           Once  A  has  matched,  PCRE is committed to this match, at the current
5608           starting position. If subsequently B matches, but C does not, the  nor-
5609           mal (*THEN) action of trying the next alternation (that is, D) does not
5610           happen because (*COMMIT) overrides.
5611    
5612    
5613  SEE ALSO  SEE ALSO
5614    
# Line 5540  AUTHOR Line 5624  AUTHOR
5624    
5625  REVISION  REVISION
5626    
5627         Last updated: 18 May 2010         Last updated: 31 October 2010
5628         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
5629  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5630    
5631    
5632  PCRESYNTAX(3)                                                    PCRESYNTAX(3)  PCRESYNTAX(3)                                                    PCRESYNTAX(3)
5633    
5634    
# Line 5912  REVISION Line 5996  REVISION
5996         Last updated: 12 May 2010         Last updated: 12 May 2010
5997         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
5998  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5999    
6000    
6001  PCREPARTIAL(3)                                                  PCREPARTIAL(3)  PCREPARTIAL(3)                                                  PCREPARTIAL(3)
6002    
6003    
# Line 5941  PARTIAL MATCHING IN PCRE Line 6025  PARTIAL MATCHING IN PCRE
6025         reflecting the character that has been typed, for example. This immedi-         reflecting the character that has been typed, for example. This immedi-
6026         ate  feedback is likely to be a better user interface than a check that         ate  feedback is likely to be a better user interface than a check that
6027         is delayed until the entire string has been entered.  Partial  matching         is delayed until the entire string has been entered.  Partial  matching
6028         can  also  sometimes be useful when the subject string is very long and         can  also be useful when the subject string is very long and is not all
6029         is not all available at once.         available at once.
6030    
6031         PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and         PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
6032         PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or         PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
# Line 5964  PARTIAL MATCHING IN PCRE Line 6048  PARTIAL MATCHING IN PCRE
6048    
6049  PARTIAL MATCHING USING pcre_exec()  PARTIAL MATCHING USING pcre_exec()
6050    
6051         A partial match occurs during a call to pcre_exec() whenever the end of         A partial match occurs during a call to pcre_exec() when the end of the
6052         the subject string is reached successfully, but  matching  cannot  con-         subject string is reached successfully, but  matching  cannot  continue
6053         tinue because more characters are needed. However, at least one charac-         because  more characters are needed. However, at least one character in
6054         ter must have been matched. (In other words, a partial match can  never         the subject must have been inspected. This character need not form part
6055         be an empty string.)         of  the  final  matched string; lookbehind assertions and the \K escape
6056           sequence provide ways of inspecting characters before the  start  of  a
6057         If  PCRE_PARTIAL_SOFT  is  set,  the  partial  match is remembered, but         matched  substring. The requirement for inspecting at least one charac-
6058         matching continues as normal, and other alternatives in the pattern are         ter exists because an empty string can always be matched; without  such
6059         tried.   If  no  complete  match  can  be  found,  pcre_exec()  returns         a  restriction there would always be a partial match of an empty string
6060         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least         at the end of the subject.
6061         two slots in the offsets vector, the first of them is set to the offset  
6062         of the earliest character that was inspected when the partial match was         If there are at least two slots in the offsets vector when  pcre_exec()
6063         found.  For  convenience,  the  second  offset points to the end of the         returns  with  a  partial match, the first slot is set to the offset of
6064         string so that a substring can easily be identified.         the earliest character that was inspected when the  partial  match  was
6065           found. For convenience, the second offset points to the end of the sub-
6066           ject so that a substring can easily be identified.
6067    
6068         For the majority of patterns, the first offset identifies the start  of         For the majority of patterns, the first offset identifies the start  of
6069         the  partially matched string. However, for patterns that contain look-         the  partially matched string. However, for patterns that contain look-
# Line 5989  PARTIAL MATCHING USING pcre_exec() Line 6075  PARTIAL MATCHING USING pcre_exec()
6075         This pattern matches "123", but only if it is preceded by "abc". If the         This pattern matches "123", but only if it is preceded by "abc". If the
6076         subject string is "xyzabc12", the offsets after a partial match are for         subject string is "xyzabc12", the offsets after a partial match are for
6077         the  substring  "abc12",  because  all  these  characters are needed if         the  substring  "abc12",  because  all  these  characters are needed if
6078         another match is tried with extra characters added.         another match is tried with extra characters added to the subject.
6079    
6080           What happens when a partial match is identified depends on which of the
6081           two partial matching options are set.
6082    
6083       PCRE_PARTIAL_SOFT with pcre_exec()
6084    
6085           If  PCRE_PARTIAL_SOFT  is  set  when  pcre_exec()  identifies a partial
6086           match, the partial match is remembered, but matching continues as  nor-
6087           mal,  and  other  alternatives in the pattern are tried. If no complete
6088           match can be found, pcre_exec() returns PCRE_ERROR_PARTIAL  instead  of
6089           PCRE_ERROR_NOMATCH.
6090    
6091           This  option  is "soft" because it prefers a complete match over a par-
6092           tial match.  All the various matching items in a pattern behave  as  if
6093           the  subject string is potentially complete. For example, \z, \Z, and $
6094           match at the end of the subject, as normal, and for \b and \B  the  end
6095           of the subject is treated as a non-alphanumeric.
6096    
6097         If there is more than one partial match, the first one that  was  found         If  there  is more than one partial match, the first one that was found
6098         provides the data that is returned. Consider this pattern:         provides the data that is returned. Consider this pattern:
6099    
6100           /123\w+X|dogY/           /123\w+X|dogY/
6101    
6102         If  this is matched against the subject string "abc123dog", both alter-         If this is matched against the subject string "abc123dog", both  alter-
6103         natives fail to match, but the end of the  subject  is  reached  during         natives  fail  to  match,  but the end of the subject is reached during
6104         matching,    so    PCRE_ERROR_PARTIAL    is    returned    instead   of         matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set  to  3
6105         PCRE_ERROR_NOMATCH. The  offsets  are  set  to  3  and  9,  identifying         and  9, identifying "123dog" as the first partial match that was found.
6106         "123dog"  as  the first partial match that was found. (In this example,         (In this example, there are two partial matches, because "dog"  on  its
6107         there are two partial matches,  because  "dog"  on  its  own  partially         own partially matches the second alternative.)
6108         matches the second alternative.)  
6109       PCRE_PARTIAL_HARD with pcre_exec()
6110    
6111         If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-         If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
6112         TIAL as soon as a partial match is found, without continuing to  search         TIAL as soon as a partial match is found, without continuing to  search
6113         for  possible  complete matches. The difference between the two options         for possible complete matches. This option is "hard" because it prefers
6114         can be illustrated by a pattern such as:         an earlier partial match over a later complete match. For this  reason,
6115           the  assumption is made that the end of the supplied subject string may
6116           not be the true end of the available data, and so, if \z, \Z,  \b,  \B,
6117           or  $  are  encountered  at  the  end  of  the  subject,  the result is
6118           PCRE_ERROR_PARTIAL.
6119    
6120       Comparing hard and soft partial matching
6121    
6122           The difference between the two partial matching options can  be  illus-
6123           trated by a pattern such as:
6124    
6125           /dog(sbody)?/           /dog(sbody)?/
6126    
6127         This matches either "dog" or "dogsbody", greedily (that is, it  prefers         This  matches either "dog" or "dogsbody", greedily (that is, it prefers
6128         the  longer  string  if  possible). If it is matched against the string         the longer string if possible). If it is  matched  against  the  string
6129         "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".         "dog"  with  PCRE_PARTIAL_SOFT,  it  yields a complete match for "dog".
6130         However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.         However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
6131         On the other hand, if the pattern is made ungreedy the result  is  dif-         On  the  other hand, if the pattern is made ungreedy the result is dif-
6132         ferent:         ferent:
6133    
6134           /dog(sbody)??/           /dog(sbody)??/
6135    
6136         In  this case the result is always a complete match because pcre_exec()         In this case the result is always a complete match because  pcre_exec()
6137         finds that first, and it never continues  after  finding  a  match.  It         finds  that  first,  and  it  never continues after finding a match. It
6138         might  be easier to follow this explanation by thinking of the two pat-         might be easier to follow this explanation by thinking of the two  pat-
6139         terns like this:         terns like this:
6140    
6141           /dog(sbody)?/    is the same as  /dogsbody|dog/           /dog(sbody)?/    is the same as  /dogsbody|dog/
6142           /dog(sbody)??/   is the same as  /dog|dogsbody/           /dog(sbody)??/   is the same as  /dog|dogsbody/
6143    
6144         The second pattern will never  match  "dogsbody"  when  pcre_exec()  is         The  second  pattern  will  never  match "dogsbody" when pcre_exec() is
6145         used, because it will always find the shorter match first.         used, because it will always find the shorter match first.
6146    
6147    
6148  PARTIAL MATCHING USING pcre_dfa_exec()  PARTIAL MATCHING USING pcre_dfa_exec()
6149    
6150         The  pcre_dfa_exec()  function moves along the subject string character         The pcre_dfa_exec() function moves along the subject  string  character
6151         by character, without backtracking, searching for all possible  matches         by  character, without backtracking, searching for all possible matches
6152         simultaneously.  If the end of the subject is reached before the end of         simultaneously. If the end of the subject is reached before the end  of
6153         the pattern, there is the possibility of a partial  match,  again  pro-         the  pattern,  there  is the possibility of a partial match, again pro-
6154         vided that at least one character has matched.         vided that at least one character has been inspected.
6155    
6156         When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if         When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
6157         there have been no complete matches. Otherwise,  the  complete  matches         there  have  been  no complete matches. Otherwise, the complete matches
6158         are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match         are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
6159         takes precedence over any complete matches. The portion of  the  string         takes  precedence  over any complete matches. The portion of the string
6160         that  was  inspected when the longest partial match was found is set as         that was inspected when the longest partial match was found is  set  as
6161         the first matching string, provided there are at least two slots in the         the first matching string, provided there are at least two slots in the
6162         offsets vector.         offsets vector.
6163    
6164         Because  pcre_dfa_exec()  always searches for all possible matches, and         Because pcre_dfa_exec() always searches for all possible  matches,  and
6165         there is no difference between greedy and ungreedy repetition, its  be-         there  is no difference between greedy and ungreedy repetition, its be-
6166         haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-         haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
6167         sider the string "dog"  matched  against  the  ungreedy  pattern  shown         sider  the  string  "dog"  matched  against  the ungreedy pattern shown
6168         above:         above:
6169    
6170           /dog(sbody)??/           /dog(sbody)??/
6171    
6172         Whereas  pcre_exec()  stops  as soon as it finds the complete match for         Whereas pcre_exec() stops as soon as it finds the  complete  match  for
6173         "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and         "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
6174         so returns that when PCRE_PARTIAL_HARD is set.         so returns that when PCRE_PARTIAL_HARD is set.
6175    
6176    
6177  PARTIAL MATCHING AND WORD BOUNDARIES  PARTIAL MATCHING AND WORD BOUNDARIES
6178    
6179         If  a  pattern ends with one of sequences \b or \B, which test for word         If a pattern ends with one of sequences \b or \B, which test  for  word
6180         boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-         boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
6181         intuitive results. Consider this pattern:         intuitive results. Consider this pattern:
6182    
6183           /\bcat\b/           /\bcat\b/
6184    
6185         This matches "cat", provided there is a word boundary at either end. If         This matches "cat", provided there is a word boundary at either end. If
6186         the subject string is "the cat", the comparison of the final "t" with a         the subject string is "the cat", the comparison of the final "t" with a
6187         following  character  cannot  take  place, so a partial match is found.         following character cannot take place, so a  partial  match  is  found.
6188         However, pcre_exec() carries on with normal matching, which matches  \b         However,  pcre_exec() carries on with normal matching, which matches \b
6189         at  the  end  of  the subject when the last character is a letter, thus         at the end of the subject when the last character  is  a  letter,  thus
6190         finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-         finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
6191         TIAL.  The  same  thing  happens  with pcre_dfa_exec(), because it also         TIAL. The same thing happens  with  pcre_dfa_exec(),  because  it  also
6192         finds the complete match.         finds the complete match.
6193    
6194         Using PCRE_PARTIAL_HARD in this  case  does  yield  PCRE_ERROR_PARTIAL,         Using  PCRE_PARTIAL_HARD  in  this  case does yield PCRE_ERROR_PARTIAL,
6195         because then the partial match takes precedence.         because then the partial match takes precedence.
6196    
6197    
6198  FORMERLY RESTRICTED PATTERNS  FORMERLY RESTRICTED PATTERNS
6199    
6200         For releases of PCRE prior to 8.00, because of the way certain internal         For releases of PCRE prior to 8.00, because of the way certain internal
6201         optimizations  were  implemented  in  the  pcre_exec()  function,   the         optimizations   were  implemented  in  the  pcre_exec()  function,  the
6202         PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be         PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
6203         used with all patterns. From release 8.00 onwards, the restrictions  no         used  with all patterns. From release 8.00 onwards, the restrictions no
6204         longer  apply,  and  partial matching with pcre_exec() can be requested         longer apply, and partial matching with pcre_exec()  can  be  requested
6205         for any pattern.         for any pattern.
6206    
6207         Items that were formerly restricted were repeated single characters and         Items that were formerly restricted were repeated single characters and
6208         repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did         repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
6209         not conform to the restrictions, pcre_exec() returned  the  error  code         not  conform  to  the restrictions, pcre_exec() returned the error code
6210         PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The         PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
6211         PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled         PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
6212         pattern can be used for partial matching now always returns 1.         pattern can be used for partial matching now always returns 1.
6213    
6214    
6215  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
6216    
6217         If  the  escape  sequence  \P  is  present in a pcretest data line, the         If the escape sequence \P is present  in  a  pcretest  data  line,  the
6218         PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of         PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
6219         pcretest that uses the date example quoted above:         pcretest that uses the date example quoted above:
6220    
6221             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
# Line 6118  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 6231  EXAMPLE OF PARTIAL MATCHING USING PCRETE
6231           data> j\P           data> j\P
6232           No match           No match
6233    
6234         The  first  data  string  is  matched completely, so pcretest shows the         The first data string is matched  completely,  so  pcretest  shows  the
6235         matched substrings. The remaining four strings do not  match  the  com-         matched  substrings.  The  remaining four strings do not match the com-
6236         plete pattern, but the first two are partial matches. Similar output is         plete pattern, but the first two are partial matches. Similar output is
6237         obtained when pcre_dfa_exec() is used.         obtained when pcre_dfa_exec() is used.
6238    
6239         If the escape sequence \P is present more than once in a pcretest  data         If  the escape sequence \P is present more than once in a pcretest data
6240         line, the PCRE_PARTIAL_HARD option is set for the match.         line, the PCRE_PARTIAL_HARD option is set for the match.
6241    
6242    
6243  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
6244    
6245         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
6246         ble to continue the match by  providing  additional  subject  data  and         ble  to  continue  the  match  by providing additional subject data and
6247         calling  pcre_dfa_exec()  again  with the same compiled regular expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
6248         sion, this time setting the PCRE_DFA_RESTART option. You must pass  the         sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
6249         same working space as before, because this is where details of the pre-         same working space as before, because this is where details of the pre-
6250         vious partial match are stored. Here  is  an  example  using  pcretest,         vious  partial  match  are  stored.  Here is an example using pcretest,
6251         using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D         using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
6252         specifies the use of pcre_dfa_exec()):         specifies the use of pcre_dfa_exec()):
6253    
6254             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
# Line 6144  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 6257  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
6257           data> n05\R\D           data> n05\R\D
6258            0: n05            0: n05
6259    
6260         The first call has "23ja" as the subject, and requests  partial  match-         The  first  call has "23ja" as the subject, and requests partial match-
6261         ing;  the  second  call  has  "n05"  as  the  subject for the continued         ing; the second call  has  "n05"  as  the  subject  for  the  continued
6262         (restarted) match.  Notice that when the match is  complete,  only  the         (restarted)  match.   Notice  that when the match is complete, only the
6263         last  part  is  shown;  PCRE  does not retain the previously partially-         last part is shown; PCRE does  not  retain  the  previously  partially-
6264         matched string. It is up to the calling program to do that if it  needs         matched  string. It is up to the calling program to do that if it needs
6265         to.         to.
6266    
6267         You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with         You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
6268         PCRE_DFA_RESTART to continue partial matching over  multiple  segments.         PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
6269         This  facility  can  be  used  to  pass  very  long  subject strings to         This facility can  be  used  to  pass  very  long  subject  strings  to
6270         pcre_dfa_exec().         pcre_dfa_exec().
6271    
6272    
6273  MULTI-SEGMENT MATCHING WITH pcre_exec()  MULTI-SEGMENT MATCHING WITH pcre_exec()
6274    
6275         From release 8.00, pcre_exec() can also be  used  to  do  multi-segment         From  release  8.00,  pcre_exec()  can also be used to do multi-segment
6276         matching.  Unlike  pcre_dfa_exec(),  it  is not possible to restart the         matching. Unlike pcre_dfa_exec(), it is not  possible  to  restart  the
6277         previous match with a new segment of data. Instead, new  data  must  be         previous  match  with  a new segment of data. Instead, new data must be
6278         added  to  the  previous  subject  string, and the entire match re-run,         added to the previous subject string,  and  the  entire  match  re-run,
6279         starting from the point where the partial match occurred. Earlier  data         starting  from the point where the partial match occurred. Earlier data
6280         can be discarded.  Consider an unanchored pattern that matches dates:         can be discarded. It is best to use PCRE_PARTIAL_HARD  in  this  situa-
6281           tion,  because it does not treat the end of a segment as the end of the
6282           subject when matching \z, \Z, \b, \B, and  $.  Consider  an  unanchored
6283           pattern that matches dates:
6284    
6285             re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/             re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6286           data> The date is 23ja\P           data> The date is 23ja\P\P
6287           Partial match: 23ja           Partial match: 23ja
6288    
6289         At  this stage, an application could discard the text preceding "23ja",         At  this stage, an application could discard the text preceding "23ja",
# Line 6188  ISSUES WITH MULTI-SEGMENT MATCHING Line 6304  ISSUES WITH MULTI-SEGMENT MATCHING
6304         Certain types of pattern may give problems with multi-segment matching,         Certain types of pattern may give problems with multi-segment matching,
6305         whichever matching function is used.         whichever matching function is used.
6306    
6307         1. If the pattern contains tests for the beginning or end  of  a  line,         1. If the pattern contains a test for the beginning of a line, you need
6308         you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-         to  pass  the  PCRE_NOTBOL  option when the subject string for any call
6309         ate, when the subject string for any call does not contain  the  begin-         does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
6310         ning or end of a line.         option, but in practice when doing multi-segment matching you should be
6311           using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
6312         2.  Lookbehind  assertions at the start of a pattern are catered for in  
6313         the offsets that are returned for a partial match. However, in  theory,         2. Lookbehind assertions at the start of a pattern are catered  for  in
6314         a  lookbehind assertion later in the pattern could require even earlier         the  offsets that are returned for a partial match. However, in theory,
6315         characters to be inspected, and it might not have been reached  when  a         a lookbehind assertion later in the pattern could require even  earlier
6316         partial  match occurs. This is probably an extremely unlikely case; you         characters  to  be inspected, and it might not have been reached when a
6317         could guard against it to a certain extent by  always  including  extra         partial match occurs. This is probably an extremely unlikely case;  you
6318           could  guard  against  it to a certain extent by always including extra
6319         characters at the start.         characters at the start.
6320    
6321         3.  Matching  a subject string that is split into multiple segments may         3. Matching a subject string that is split into multiple  segments  may
6322         not always produce exactly the same result as matching over one  single         not  always produce exactly the same result as matching over one single
6323         long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section         long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
6324         "Partial Matching and Word Boundaries" above describes  an  issue  that         "Partial  Matching  and  Word Boundaries" above describes an issue that
6325         arises  if  the  pattern ends with \b or \B. Another kind of difference         arises if the pattern ends with \b or \B. Another  kind  of  difference
6326         may occur when there are multiple  matching  possibilities,  because  a         may  occur when there are multiple matching possibilities, because (for
6327         partial match result is given only when there are no completed matches.         PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
6328         This means that as soon as the shortest match has been found, continua-         no completed matches. This means that as soon as the shortest match has
6329         tion  to  a  new subject segment is no longer possible.  Consider again         been found, continuation to a new subject segment is no  longer  possi-
6330         this pcretest example:         ble. Consider again this pcretest example:
6331    
6332             re> /dog(sbody)?/             re> /dog(sbody)?/
6333           data> dogsb\P           data> dogsb\P
# Line 6223  ISSUES WITH MULTI-SEGMENT MATCHING Line 6340  ISSUES WITH MULTI-SEGMENT MATCHING
6340            0: dogsbody            0: dogsbody
6341            1: dog            1: dog
6342    
6343         The first data line passes the string "dogsb" to  pcre_exec(),  setting         The  first  data line passes the string "dogsb" to pcre_exec(), setting
6344         the  PCRE_PARTIAL_SOFT  option.  Although the string is a partial match         the PCRE_PARTIAL_SOFT option. Although the string is  a  partial  match
6345         for "dogsbody", the  result  is  not  PCRE_ERROR_PARTIAL,  because  the         for  "dogsbody",  the  result  is  not  PCRE_ERROR_PARTIAL, because the
6346         shorter  string  "dog" is a complete match. Similarly, when the subject         shorter string "dog" is a complete match. Similarly, when  the  subject
6347         is presented to pcre_dfa_exec() in several parts ("do" and "gsb"  being         is  presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
6348         the first two) the match stops when "dog" has been found, and it is not         the first two) the match stops when "dog" has been found, and it is not
6349         possible to continue. On the other hand, if "dogsbody" is presented  as         possible  to continue. On the other hand, if "dogsbody" is presented as
6350         a single string, pcre_dfa_exec() finds both matches.         a single string, pcre_dfa_exec() finds both matches.
6351    
6352         Because of these problems, it is probably best to use PCRE_PARTIAL_HARD         Because of these problems, it is best  to  use  PCRE_PARTIAL_HARD  when
6353         when matching multi-segment data. The example above then  behaves  dif-         matching  multi-segment  data.  The  example above then behaves differ-
6354         ferently:         ently:
6355    
6356             re> /dog(sbody)?/             re> /dog(sbody)?/
6357           data> dogsb\P\P           data> dogsb\P\P
# Line 6246  ISSUES WITH MULTI-SEGMENT MATCHING Line 6363  ISSUES WITH MULTI-SEGMENT MATCHING
6363    
6364    
6365         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
6366         start with the  same  pattern  item  may  not  work  as  expected  when         start  with  the  same  pattern  item  may  not  work  as expected when
6367         PCRE_DFA_RESTART  is  used  with pcre_dfa_exec(). For example, consider         PCRE_DFA_RESTART is used with pcre_dfa_exec().  For  example,  consider
6368         this pattern:         this pattern:
6369    
6370           1234|3789           1234|3789
6371    
6372         If the first part of the subject is "ABC123", a partial  match  of  the         If  the  first  part of the subject is "ABC123", a partial match of the
6373         first  alternative  is found at offset 3. There is no partial match for         first alternative is found at offset 3. There is no partial  match  for
6374         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
6375         point  in  the  subject  string. Attempting to continue with the string         point in the subject string. Attempting to  continue  with  the  string
6376         "7890" does not yield a match  because  only  those  alternatives  that         "7890"  does  not  yield  a  match because only those alternatives that
6377         match  at  one  point in the subject are remembered. The problem arises         match at one point in the subject are remembered.  The  problem  arises
6378         because the start of the second alternative matches  within  the  first         because  the  start  of the second alternative matches within the first
6379         alternative.  There  is  no  problem with anchored patterns or patterns         alternative. There is no problem with  anchored  patterns  or  patterns
6380         such as:         such as:
6381    
6382           1234|ABCD           1234|ABCD
6383    
6384         where no string can be a partial match for both alternatives.  This  is         where  no  string can be a partial match for both alternatives. This is
6385         not  a  problem if pcre_exec() is used, because the entire match has to         not a problem if pcre_exec() is used, because the entire match  has  to
6386         be rerun each time:         be rerun each time:
6387    
6388             re> /1234|3789/             re> /1234|3789/
6389           data> ABC123\P           data> ABC123\P\P
6390           Partial match: 123           Partial match: 123
6391           data> 1237890           data> 1237890
6392            0: 3789            0: 3789
6393    
6394         Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-         Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
6395         running the entire match can also be used with pcre_dfa_exec(). Another         running the entire match can also be used with pcre_dfa_exec(). Another
6396         possibility is to work with two buffers. If a partial match at offset n         possibility is to work with two buffers. If a partial match at offset n
6397         in  the first buffer is followed by "no match" when PCRE_DFA_RESTART is         in the first buffer is followed by "no match" when PCRE_DFA_RESTART  is
6398         used on the second buffer, you can then try a  new  match  starting  at         used  on  the  second  buffer, you can then try a new match starting at
6399         offset n+1 in the first buffer.         offset n+1 in the first buffer.
6400    
6401    
# Line 6291  AUTHOR Line 6408  AUTHOR
6408    
6409  REVISION  REVISION
6410    
6411         Last updated: 19 October 2009         Last updated: 22 October 2010
6412         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6413  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6414    
6415    
6416  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
6417    
6418    
# Line 6418  REVISION Line 6535  REVISION
6535         Last updated: 13 June 2007         Last updated: 13 June 2007
6536         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
6537  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6538    
6539    
6540  PCREPERFORM(3)                                                  PCREPERFORM(3)  PCREPERFORM(3)                                                  PCREPERFORM(3)
6541    
6542    
# Line 6586  REVISION Line 6703  REVISION
6703         Last updated: 16 May 2010         Last updated: 16 May 2010
6704         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6705  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6706    
6707    
6708  PCREPOSIX(3)                                                      PCREPOSIX(3)  PCREPOSIX(3)                                                      PCREPOSIX(3)
6709    
6710    
# Line 6849  REVISION Line 6966  REVISION
6966         Last updated: 16 May 2010         Last updated: 16 May 2010
6967         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6968  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6969    
6970    
6971  PCRECPP(3)                                                          PCRECPP(3)  PCRECPP(3)                                                          PCRECPP(3)
6972    
6973    
# Line 7190  REVISION Line 7307  REVISION
7307    
7308         Last updated: 17 March 2009         Last updated: 17 March 2009
7309  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7310    
7311    
7312  PCRESAMPLE(3)                                                    PCRESAMPLE(3)  PCRESAMPLE(3)                                                    PCRESAMPLE(3)
7313    
7314    
# Line 7426  REVISION Line 7543  REVISION
7543         Last updated: 03 January 2010         Last updated: 03 January 2010
7544         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
7545  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7546    
7547    

Legend:
Removed from v.566  
changed lines
  Added in v.567

  ViewVC Help
Powered by ViewVC 1.1.5