/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 548 by ph10, Fri Jun 25 14:42:00 2010 UTC revision 589 by ph10, Sat Jan 15 11:31:39 2011 UTC
# Line 26  INTRODUCTION Line 26  INTRODUCTION
26         give better JavaScript compatibility.         give better JavaScript compatibility.
27    
28         The  current implementation of PCRE corresponds approximately with Perl         The  current implementation of PCRE corresponds approximately with Perl
29         5.10/5.11, including support for UTF-8 encoded strings and Unicode gen-         5.12, including support for UTF-8 encoded strings and  Unicode  general
30         eral  category properties. However, UTF-8 and Unicode support has to be         category  properties.  However,  UTF-8  and  Unicode  support has to be
31         explicitly enabled; it is not the default. The  Unicode  tables  corre-         explicitly enabled; it is not the default. The  Unicode  tables  corre-
32         spond to Unicode release 5.2.0.         spond to Unicode release 5.2.0.
33    
# Line 226  UTF-8 AND UNICODE PROPERTY SUPPORT Line 226  UTF-8 AND UNICODE PROPERTY SUPPORT
226         PCRE  recognizes  as digits, spaces, or word characters remain the same         PCRE  recognizes  as digits, spaces, or word characters remain the same
227         set as before, all with values less than 256. This  remains  true  even         set as before, all with values less than 256. This  remains  true  even
228         when  PCRE  is built to include Unicode property support, because to do         when  PCRE  is built to include Unicode property support, because to do
229         otherwise would slow down PCRE in many common  cases.  Note  that  this         otherwise would slow down PCRE in many common cases. Note in particular
230         also applies to \b, because it is defined in terms of \w and \W. If you         that this applies to \b and \B, because they are defined in terms of \w
231         really want to test for a wider sense of, say,  "digit",  you  can  use         and \W. If you really want to test for a wider sense of, say,  "digit",
232         explicit  Unicode property tests such as \p{Nd}.  Alternatively, if you         you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-
233         set the PCRE_UCP option, the way that the  character  escapes  work  is         tively, if you set the PCRE_UCP option,  the  way  that  the  character
234         changed  so that Unicode properties are used to determine which charac-         escapes  work  is changed so that Unicode properties are used to deter-
235         ters match. There are more details in the section on generic  character         mine which characters match. There are more details in the  section  on
236         types in the pcrepattern documentation.         generic character types in the pcrepattern documentation.
237    
238         7.  Similarly,  characters that match the POSIX named character classes         7.  Similarly,  characters that match the POSIX named character classes
239         are all low-valued characters, unless the PCRE_UCP option is set.         are all low-valued characters, unless the PCRE_UCP option is set.
240    
241         8. However, the Perl 5.10 horizontal and vertical  whitespace  matching         8. However, the horizontal and  vertical  whitespace  matching  escapes
242         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-         (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
243         acters, whether or not PCRE_UCP is set.         whether or not PCRE_UCP is set.
244    
245         9. Case-insensitive matching applies only to  characters  whose  values         9. Case-insensitive matching applies only to  characters  whose  values
246         are  less than 128, unless PCRE is built with Unicode property support.         are  less than 128, unless PCRE is built with Unicode property support.
247         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
248         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
249         so as not to degrade performance.  The Unicode property information  is         so as not to degrade performance.  The Unicode property information  is
250         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Furthermore, PCRE supports
251         support is available, PCRE supports case-insensitive matching only when         case-insensitive matching only  when  there  is  a  one-to-one  mapping
252         there  is  a  one-to-one  mapping between a letter's cases. There are a         between  a letter's cases. There are a small number of many-to-one map-
253         small number of many-to-one mappings in Unicode;  these  are  not  sup-         pings in Unicode; these are not supported by PCRE.
        ported by PCRE.  
254    
255    
256  AUTHOR  AUTHOR
# Line 260  AUTHOR Line 259  AUTHOR
259         University Computing Service         University Computing Service
260         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
261    
262         Putting  an actual email address here seems to have been a spam magnet,         Putting an actual email address here seems to have been a spam  magnet,
263         so I've taken it away. If you want to email me, use  my  two  initials,         so  I've  taken  it away. If you want to email me, use my two initials,
264         followed by the two digits 10, at the domain cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
265    
266    
267  REVISION  REVISION
268    
269         Last updated: 12 May 2010         Last updated: 13 November 2010
270         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
271  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
272    
# Line 697  THE ALTERNATIVE MATCHING ALGORITHM Line 696  THE ALTERNATIVE MATCHING ALGORITHM
696         represent the different matching possibilities (if there are none,  the         represent the different matching possibilities (if there are none,  the
697         match  has  failed).   Thus,  if there is more than one possible match,         match  has  failed).   Thus,  if there is more than one possible match,
698         this algorithm finds all of them, and in particular, it finds the long-         this algorithm finds all of them, and in particular, it finds the long-
699         est.  There  is  an  option to stop the algorithm after the first match         est.  The  matches are returned in decreasing order of length. There is
700         (which is necessarily the shortest) is found.         an option to stop the algorithm after the first match (which is  neces-
701           sarily the shortest) is found.
702    
703         Note that all the matches that are found start at the same point in the         Note that all the matches that are found start at the same point in the
704         subject. If the pattern         subject. If the pattern
705    
706           cat(er(pillar)?)           cat(er(pillar)?)?
707    
708         is  matched  against the string "the caterpillar catchment", the result         is matched against the string "the caterpillar catchment",  the  result
709         will be the three strings "cat", "cater", and "caterpillar" that  start         will  be the three strings "caterpillar", "cater", and "cat" that start
710         at the fourth character of the subject. The algorithm does not automat-         at the fifth character of the subject. The algorithm does not automati-
711         ically move on to find matches that start at later positions.         cally move on to find matches that start at later positions.
712    
713         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
714         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
715    
716         1.  Because  the  algorithm  finds  all possible matches, the greedy or         1. Because the algorithm finds all  possible  matches,  the  greedy  or
717         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
718         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
719         sessive quantifiers can make a difference when what follows could  also         sessive  quantifiers can make a difference when what follows could also
720         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
721    
722           ^a++\w!           ^a++\w!
723    
724         This  pattern matches "aaab!" but not "aaa!", which would be matched by         This pattern matches "aaab!" but not "aaa!", which would be matched  by
725         a non-possessive quantifier. Similarly, if an atomic group is  present,         a  non-possessive quantifier. Similarly, if an atomic group is present,
726         it  is matched as if it were a standalone pattern at the current point,         it is matched as if it were a standalone pattern at the current  point,
727         and the longest match is then "locked in" for the rest of  the  overall         and  the  longest match is then "locked in" for the rest of the overall
728         pattern.         pattern.
729    
730         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
731         is not straightforward to keep track of  captured  substrings  for  the         is  not  straightforward  to  keep track of captured substrings for the
732         different  matching  possibilities,  and  PCRE's implementation of this         different matching possibilities, and  PCRE's  implementation  of  this
733         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
734         strings are available.         strings are available.
735    
736         3.  Because no substrings are captured, back references within the pat-         3. Because no substrings are captured, back references within the  pat-
737         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
738    
739         4. For the same reason, conditional expressions that use  a  backrefer-         4.  For  the same reason, conditional expressions that use a backrefer-
740         ence  as  the  condition or test for a specific group recursion are not         ence as the condition or test for a specific group  recursion  are  not
741         supported.         supported.
742    
743         5. Because many paths through the tree may be  active,  the  \K  escape         5.  Because  many  paths  through the tree may be active, the \K escape
744         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
745         be on some paths and not on others), is not  supported.  It  causes  an         be  on  some  paths  and not on others), is not supported. It causes an
746         error if encountered.         error if encountered.
747    
748         6.  Callouts  are  supported, but the value of the capture_top field is         6. Callouts are supported, but the value of the  capture_top  field  is
749         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
750    
751         7. The \C escape sequence, which (in the standard algorithm) matches  a         7.  The \C escape sequence, which (in the standard algorithm) matches a
752         single  byte, even in UTF-8 mode, is not supported because the alterna-         single byte, even in UTF-8 mode, is not supported because the  alterna-
753         tive algorithm moves through the subject  string  one  character  at  a         tive  algorithm  moves  through  the  subject string one character at a
754         time, for all active paths through the tree.         time, for all active paths through the tree.
755    
756         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
757         are not supported. (*FAIL) is supported, and  behaves  like  a  failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
758         negative assertion.         negative assertion.
759    
760    
761  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
762    
763         Using  the alternative matching algorithm provides the following advan-         Using the alternative matching algorithm provides the following  advan-
764         tages:         tages:
765    
766         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
767         ically  found,  and  in particular, the longest match is found. To find         ically found, and in particular, the longest match is  found.  To  find
768         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
769         things with callouts.         things with callouts.
770    
771         2.  Because  the  alternative  algorithm  scans the subject string just         2. Because the alternative algorithm  scans  the  subject  string  just
772         once, and never needs to backtrack, it is possible to  pass  very  long         once,  and  never  needs to backtrack, it is possible to pass very long
773         subject  strings  to  the matching function in several pieces, checking         subject strings to the matching function in  several  pieces,  checking
774         for partial matching each time.  The  pcrepartial  documentation  gives         for  partial  matching  each time. Although it is possible to do multi-
775         details of partial matching.         segment matching using the standard algorithm (pcre_exec()), by retain-
776           ing  partially matched substrings, it is more complicated. The pcrepar-
777           tial documentation gives details  of  partial  matching  and  discusses
778           multi-segment matching.
779    
780    
781  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
# Line 798  AUTHOR Line 801  AUTHOR
801    
802  REVISION  REVISION
803    
804         Last updated: 29 September 2009         Last updated: 17 November 2010
805         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
806  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
807    
808    
# Line 1162  COMPILING A PATTERN Line 1165  COMPILING A PATTERN
1165         pcrepattern documentation). For those options that can be different  in         pcrepattern documentation). For those options that can be different  in
1166         different  parts  of  the pattern, the contents of the options argument         different  parts  of  the pattern, the contents of the options argument
1167         specifies their settings at the start of compilation and execution. The         specifies their settings at the start of compilation and execution. The
1168         PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at         PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
1169         the time of matching as well as at compile time.         PCRE_NO_START_OPT options can be set at the time of matching as well as
1170           at compile time.
1171    
1172         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1173         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
1174         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1175         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
1176         try to free it. The byte offset from the start of the  pattern  to  the         try  to  free  it. The offset from the start of the pattern to the byte
1177         character  that  was  being  processed when the error was discovered is         that was being processed when the error was discovered is placed in the
1178         placed in the variable pointed to by erroffset, which must not be NULL.         variable  pointed to by erroffset, which must not be NULL. If it is, an
1179         If  it  is,  an  immediate error is given. Some errors are not detected         immediate error is given. Some errors are not detected until checks are
1180         until checks are carried out when the whole pattern has  been  scanned;         carried  out  when the whole pattern has been scanned; in this case the
1181         in this case the offset is set to the end of the pattern.         offset is set to the end of the pattern.
1182    
1183         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-         Note that the offset is in bytes, not characters, even in  UTF-8  mode.
1184         codeptr argument is not NULL, a non-zero error code number is  returned         It  may  point  into the middle of a UTF-8 character (for example, when
1185         via  this argument in the event of an error. This is in addition to the         PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
1186    
1187           If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
1188           codeptr  argument is not NULL, a non-zero error code number is returned
1189           via this argument in the event of an error. This is in addition to  the
1190         textual error message. Error codes and messages are listed below.         textual error message. Error codes and messages are listed below.
1191    
1192         If the final argument, tableptr, is NULL, PCRE uses a  default  set  of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
1193         character  tables  that  are  built  when  PCRE  is compiled, using the         character tables that are  built  when  PCRE  is  compiled,  using  the
1194         default C locale. Otherwise, tableptr must be an address  that  is  the         default  C  locale.  Otherwise, tableptr must be an address that is the
1195         result  of  a  call to pcre_maketables(). This value is stored with the         result of a call to pcre_maketables(). This value is  stored  with  the
1196         compiled pattern, and used again by pcre_exec(), unless  another  table         compiled  pattern,  and used again by pcre_exec(), unless another table
1197         pointer is passed to it. For more discussion, see the section on locale         pointer is passed to it. For more discussion, see the section on locale
1198         support below.         support below.
1199    
1200         This code fragment shows a typical straightforward  call  to  pcre_com-         This  code  fragment  shows a typical straightforward call to pcre_com-
1201         pile():         pile():
1202    
1203           pcre *re;           pcre *re;
# Line 1202  COMPILING A PATTERN Line 1210  COMPILING A PATTERN
1210             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
1211             NULL);            /* use default character tables */             NULL);            /* use default character tables */
1212    
1213         The  following  names  for option bits are defined in the pcre.h header         The following names for option bits are defined in  the  pcre.h  header
1214         file:         file:
1215    
1216           PCRE_ANCHORED           PCRE_ANCHORED
1217    
1218         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
1219         is  constrained to match only at the first matching point in the string         is constrained to match only at the first matching point in the  string
1220         that is being searched (the "subject string"). This effect can also  be         that  is being searched (the "subject string"). This effect can also be
1221         achieved  by appropriate constructs in the pattern itself, which is the         achieved by appropriate constructs in the pattern itself, which is  the
1222         only way to do it in Perl.         only way to do it in Perl.
1223    
1224           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
1225    
1226         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
1227         all  with  number  255, before each pattern item. For discussion of the         all with number 255, before each pattern item. For  discussion  of  the
1228         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
1229    
1230           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
1231           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
1232    
1233         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
1234         sequence  matches.  The choice is either to match only CR, LF, or CRLF,         sequence matches. The choice is either to match only CR, LF,  or  CRLF,
1235         or to match any Unicode newline sequence. The default is specified when         or to match any Unicode newline sequence. The default is specified when
1236         PCRE is built. It can be overridden from within the pattern, or by set-         PCRE is built. It can be overridden from within the pattern, or by set-
1237         ting an option when a compiled pattern is matched.         ting an option when a compiled pattern is matched.
1238    
1239           PCRE_CASELESS           PCRE_CASELESS
1240    
1241         If this bit is set, letters in the pattern match both upper  and  lower         If  this  bit is set, letters in the pattern match both upper and lower
1242         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
1243         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
1244         always  understands the concept of case for characters whose values are         always understands the concept of case for characters whose values  are
1245         less than 128, so caseless matching is always possible. For  characters         less  than 128, so caseless matching is always possible. For characters
1246         with  higher  values,  the concept of case is supported if PCRE is com-         with higher values, the concept of case is supported if  PCRE  is  com-
1247         piled with Unicode property support, but not otherwise. If you want  to         piled  with Unicode property support, but not otherwise. If you want to
1248         use  caseless  matching  for  characters 128 and above, you must ensure         use caseless matching for characters 128 and  above,  you  must  ensure
1249         that PCRE is compiled with Unicode property support  as  well  as  with         that  PCRE  is  compiled  with Unicode property support as well as with
1250         UTF-8 support.         UTF-8 support.
1251    
1252           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
1253    
1254         If  this bit is set, a dollar metacharacter in the pattern matches only         If this bit is set, a dollar metacharacter in the pattern matches  only
1255         at the end of the subject string. Without this option,  a  dollar  also         at  the  end  of the subject string. Without this option, a dollar also
1256         matches  immediately before a newline at the end of the string (but not         matches immediately before a newline at the end of the string (but  not
1257         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
1258         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
1259         Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1260    
1261           PCRE_DOTALL           PCRE_DOTALL
1262    
1263         If this bit is set, a dot metacharater in the pattern matches all char-         If  this bit is set, a dot metacharacter in the pattern matches a char-
1264         acters,  including  those that indicate newline. Without it, a dot does         acter of any value, including one that indicates a newline. However, it
1265         not match when the current position is at a  newline.  This  option  is         only  ever  matches  one character, even if newlines are coded as CRLF.
1266         equivalent  to Perl's /s option, and it can be changed within a pattern         Without this option, a dot does not match when the current position  is
1267         by a (?s) option setting. A negative class such as [^a] always  matches         at a newline. This option is equivalent to Perl's /s option, and it can
1268         newline characters, independent of the setting of this option.         be changed within a pattern by a (?s) option setting. A negative  class
1269           such as [^a] always matches newline characters, independent of the set-
1270           ting of this option.
1271    
1272           PCRE_DUPNAMES           PCRE_DUPNAMES
1273    
1274         If  this  bit is set, names used to identify capturing subpatterns need         If this bit is set, names used to identify capturing  subpatterns  need
1275         not be unique. This can be helpful for certain types of pattern when it         not be unique. This can be helpful for certain types of pattern when it
1276         is  known  that  only  one instance of the named subpattern can ever be         is known that only one instance of the named  subpattern  can  ever  be
1277         matched. There are more details of named subpatterns  below;  see  also         matched.  There  are  more details of named subpatterns below; see also
1278         the pcrepattern documentation.         the pcrepattern documentation.
1279    
1280           PCRE_EXTENDED           PCRE_EXTENDED
1281    
1282         If  this  bit  is  set,  whitespace  data characters in the pattern are         If this bit is set, whitespace  data  characters  in  the  pattern  are
1283         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1284         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1285         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1286         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
1287         option, and it can be changed within a pattern by a  (?x)  option  set-         option,  and  it  can be changed within a pattern by a (?x) option set-
1288         ting.         ting.
1289    
1290           Which characters are interpreted  as  newlines  is  controlled  by  the
1291           options  passed to pcre_compile() or by a special sequence at the start
1292           of the pattern, as described in the section entitled  "Newline  conven-
1293           tions" in the pcrepattern documentation. Note that the end of this type
1294           of comment is  a  literal  newline  sequence  in  the  pattern;  escape
1295           sequences that happen to represent a newline do not count.
1296    
1297         This  option  makes  it possible to include comments inside complicated         This  option  makes  it possible to include comments inside complicated
1298         patterns.  Note, however, that this applies only  to  data  characters.         patterns.  Note, however, that this applies only  to  data  characters.
1299         Whitespace   characters  may  never  appear  within  special  character         Whitespace   characters  may  never  appear  within  special  character
1300         sequences in a pattern, for  example  within  the  sequence  (?(  which         sequences in a pattern, for example within the sequence (?( that intro-
1301         introduces a conditional subpattern.         duces a conditional subpattern.
1302    
1303           PCRE_EXTRA           PCRE_EXTRA
1304    
# Line 1363  COMPILING A PATTERN Line 1380  COMPILING A PATTERN
1380         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and
1381         cause an error.         cause an error.
1382    
1383         The  only time that a line break is specially recognized when compiling         The  only  time  that a line break in a pattern is specially recognized
1384         a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a         when compiling is when PCRE_EXTENDED is set. CR and LF  are  whitespace
1385         character  class  is  encountered.  This indicates a comment that lasts         characters,  and so are ignored in this mode. Also, an unescaped # out-
1386         until after the next line break sequence. In other circumstances,  line         side a character class indicates a comment that lasts until  after  the
1387         break   sequences   are   treated  as  literal  data,  except  that  in         next  line break sequence. In other circumstances, line break sequences
1388         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         in patterns are treated as literal data.
        and are therefore ignored.  
1389    
1390         The newline option that is set at compile time becomes the default that         The newline option that is set at compile time becomes the default that
1391         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
# Line 1377  COMPILING A PATTERN Line 1393  COMPILING A PATTERN
1393           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1394    
1395         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
1396         theses  in the pattern. Any opening parenthesis that is not followed by         theses in the pattern. Any opening parenthesis that is not followed  by
1397         ? behaves as if it were followed by ?: but named parentheses can  still         ?  behaves as if it were followed by ?: but named parentheses can still
1398         be  used  for  capturing  (and  they acquire numbers in the usual way).         be used for capturing (and they acquire  numbers  in  the  usual  way).
1399         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1400    
1401             NO_START_OPTIMIZE
1402    
1403           This  is an option that acts at matching time; that is, it is really an
1404           option for pcre_exec() or pcre_dfa_exec(). If  it  is  set  at  compile
1405           time,  it is remembered with the compiled pattern and assumed at match-
1406           ing time. For details  see  the  discussion  of  PCRE_NO_START_OPTIMIZE
1407           below.
1408    
1409           PCRE_UCP           PCRE_UCP
1410    
1411         This option changes the way PCRE processes \b, \d, \s, \w, and some  of         This  option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
1412         the POSIX character classes. By default, only ASCII characters are rec-         \w, and some of the POSIX character classes.  By  default,  only  ASCII
1413         ognized, but if PCRE_UCP is set, Unicode properties are used instead to         characters  are  recognized, but if PCRE_UCP is set, Unicode properties
1414         classify  characters.  More details are given in the section on generic         are used instead to classify characters. More details are given in  the
1415         character types in the pcrepattern page. If you set PCRE_UCP,  matching         section  on generic character types in the pcrepattern page. If you set
1416         one  of the items it affects takes much longer. The option is available         PCRE_UCP, matching one of the items it affects takes much  longer.  The
1417         only if PCRE has been compiled with Unicode property support.         option  is  available only if PCRE has been compiled with Unicode prop-
1418           erty support.
1419    
1420           PCRE_UNGREEDY           PCRE_UNGREEDY
1421    
# Line 1562  STUDYING A PATTERN Line 1587  STUDYING A PATTERN
1587         The two optimizations just described can be  disabled  by  setting  the         The two optimizations just described can be  disabled  by  setting  the
1588         PCRE_NO_START_OPTIMIZE    option    when    calling    pcre_exec()   or         PCRE_NO_START_OPTIMIZE    option    when    calling    pcre_exec()   or
1589         pcre_dfa_exec(). You might want to do this  if  your  pattern  contains         pcre_dfa_exec(). You might want to do this  if  your  pattern  contains
1590         callouts,  or  make  use of (*MARK), and you make use of these in cases         callouts  or  (*MARK),  and you want to make use of these facilities in
1591         where matching fails.  See  the  discussion  of  PCRE_NO_START_OPTIMIZE         cases where matching fails. See the discussion  of  PCRE_NO_START_OPTI-
1592         below.         MIZE below.
1593    
1594    
1595  LOCALE SUPPORT  LOCALE SUPPORT
# Line 2123  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2148  MATCHING A PATTERN: THE TRADITIONAL FUNC
2148         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
2149         fails, by advancing the starting offset (see below) and trying an ordi-         fails, by advancing the starting offset (see below) and trying an ordi-
2150         nary  match  again. There is some code that demonstrates how to do this         nary  match  again. There is some code that demonstrates how to do this
2151         in the pcredemo sample program.         in the pcredemo sample program. In the most general case, you  have  to
2152           check  to  see  if the newline convention recognizes CRLF as a newline,
2153           and if so, and the current character is CR followed by LF, advance  the
2154           starting offset by two characters instead of one.
2155    
2156           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2157    
2158         There are a number of optimizations that pcre_exec() uses at the  start         There  are a number of optimizations that pcre_exec() uses at the start
2159         of  a  match,  in  order to speed up the process. For example, if it is         of a match, in order to speed up the process. For  example,  if  it  is
2160         known that an unanchored match must start with a specific character, it         known that an unanchored match must start with a specific character, it
2161         searches  the  subject  for that character, and fails immediately if it         searches the subject for that character, and fails  immediately  if  it
2162         cannot find it, without actually running the  main  matching  function.         cannot  find  it,  without actually running the main matching function.
2163         This means that a special item such as (*COMMIT) at the start of a pat-         This means that a special item such as (*COMMIT) at the start of a pat-
2164         tern is not considered until after a suitable starting  point  for  the         tern  is  not  considered until after a suitable starting point for the
2165         match  has been found. When callouts or (*MARK) items are in use, these         match has been found. When callouts or (*MARK) items are in use,  these
2166         "start-up" optimizations can cause them to be skipped if the pattern is         "start-up" optimizations can cause them to be skipped if the pattern is
2167         never  actually  used.  The start-up optimizations are in effect a pre-         never actually used. The start-up optimizations are in  effect  a  pre-
2168         scan of the subject that takes place before the pattern is run.         scan of the subject that takes place before the pattern is run.
2169    
2170         The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,         The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
2171         possibly  causing  performance  to  suffer,  but ensuring that in cases         possibly causing performance to suffer,  but  ensuring  that  in  cases
2172         where the result is "no match", the callouts do occur, and  that  items         where  the  result is "no match", the callouts do occur, and that items
2173         such as (*COMMIT) and (*MARK) are considered at every possible starting         such as (*COMMIT) and (*MARK) are considered at every possible starting
2174         position in the subject  string.   Setting  PCRE_NO_START_OPTIMIZE  can         position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
2175         change the outcome of a matching operation.  Consider the pattern         compile time, it cannot be unset at matching time.
2176    
2177           Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
2178           operation.  Consider the pattern
2179    
2180           (*COMMIT)ABC           (*COMMIT)ABC
2181    
# Line 2179  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2210  MATCHING A PATTERN: THE TRADITIONAL FUNC
2210         points  to  the start of a UTF-8 character. There is a discussion about         points  to  the start of a UTF-8 character. There is a discussion about
2211         the validity of UTF-8 strings in the section on UTF-8  support  in  the         the validity of UTF-8 strings in the section on UTF-8  support  in  the
2212         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2213         pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-         pcre_exec() returns  the  error  PCRE_ERROR_BADUTF8  or,  if  PCRE_PAR-
2214         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.         TIAL_HARD  is set and the problem is a truncated UTF-8 character at the
2215           end of the subject, PCRE_ERROR_SHORTUTF8.  If  startoffset  contains  a
2216         If  you  already  know that your subject is valid, and you want to skip         value  that does not point to the start of a UTF-8 character (or to the
2217         these   checks   for   performance   reasons,   you   can    set    the         end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2218         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to  
2219         do this for the second and subsequent calls to pcre_exec() if  you  are         If you already know that your subject is valid, and you  want  to  skip
2220         making  repeated  calls  to  find  all  the matches in a single subject         these    checks    for   performance   reasons,   you   can   set   the
2221         string. However, you should be  sure  that  the  value  of  startoffset         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
2222         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is         do  this  for the second and subsequent calls to pcre_exec() if you are
2223         set, the effect of passing an invalid UTF-8 string as a subject,  or  a         making repeated calls to find all  the  matches  in  a  single  subject
2224         value  of startoffset that does not point to the start of a UTF-8 char-         string.  However,  you  should  be  sure  that the value of startoffset
2225         acter, is undefined. Your program may crash.         points to the start of a UTF-8 character (or the end of  the  subject).
2226           When  PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8
2227           string as a subject or an invalid value of  startoffset  is  undefined.
2228           Your program may crash.
2229    
2230           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2231           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2232    
2233         These options turn on the partial matching feature. For backwards  com-         These  options turn on the partial matching feature. For backwards com-
2234         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
2235         match occurs if the end of the subject string is reached  successfully,         match  occurs if the end of the subject string is reached successfully,
2236         but  there  are not enough subject characters to complete the match. If         but there are not enough subject characters to complete the  match.  If
2237         this happens when PCRE_PARTIAL_HARD  is  set,  pcre_exec()  immediately         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2238         returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,         matching continues by testing any remaining alternatives.  Only  if  no
2239         matching continues by testing any other alternatives. Only if they  all         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
2240         fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).         PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
2241         The portion of the string that was inspected when the partial match was         caller  is  prepared to handle a partial match, but only if no complete
2242         found  is  set  as  the first matching string. There is a more detailed         match can be found.
2243         discussion in the pcrepartial documentation.  
2244           If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
2245           case,  if  a  partial  match  is found, pcre_exec() immediately returns
2246           PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
2247           other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
2248           ered to be more important that an alternative complete match.
2249    
2250           In both cases, the portion of the string that was  inspected  when  the
2251           partial match was found is set as the first matching string. There is a
2252           more detailed discussion of partial and  multi-segment  matching,  with
2253           examples, in the pcrepartial documentation.
2254    
2255     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2256    
2257         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
2258         length (in bytes) in length, and a starting byte offset in startoffset.         length (in bytes) in length, and a starting byte offset in startoffset.
2259         In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-         If  this  is  negative  or  greater  than  the  length  of the subject,
2260         acter.  Unlike  the pattern string, the subject may contain binary zero         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
2261         bytes. When the starting offset is zero, the search for a match  starts         zero,  the  search  for a match starts at the beginning of the subject,
2262         at  the  beginning  of  the subject, and this is by far the most common         and this is by far the most common case. In UTF-8 mode, the byte offset
2263         case.         must  point  to  the start of a UTF-8 character (or the end of the sub-
2264           ject). Unlike the pattern string, the subject may contain  binary  zero
2265         A non-zero starting offset is useful when searching for  another  match         bytes.
2266         in  the same subject by calling pcre_exec() again after a previous suc-  
2267         cess.  Setting startoffset differs from just passing over  a  shortened         A  non-zero  starting offset is useful when searching for another match
2268         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins         in the same subject by calling pcre_exec() again after a previous  suc-
2269           cess.   Setting  startoffset differs from just passing over a shortened
2270           string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
2271         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2272    
2273           \Biss\B           \Biss\B
2274    
2275         which finds occurrences of "iss" in the middle of  words.  (\B  matches         which  finds  occurrences  of "iss" in the middle of words. (\B matches
2276         only  if  the  current position in the subject is not a word boundary.)         only if the current position in the subject is not  a  word  boundary.)
2277         When applied to the string "Mississipi" the first call  to  pcre_exec()         When  applied  to the string "Mississipi" the first call to pcre_exec()
2278         finds  the  first  occurrence. If pcre_exec() is called again with just         finds the first occurrence. If pcre_exec() is called  again  with  just
2279         the remainder of the subject,  namely  "issipi",  it  does  not  match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
2280         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2281         to be a word boundary. However, if pcre_exec()  is  passed  the  entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
2282         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2283         rence of "iss" because it is able to look behind the starting point  to         rence  of "iss" because it is able to look behind the starting point to
2284         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2285    
2286           Finding all the matches in a subject is tricky  when  the  pattern  can
2287           match an empty string. It is possible to emulate Perl's /g behaviour by
2288           first  trying  the  match  again  at  the   same   offset,   with   the
2289           PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that
2290           fails, advancing the starting  offset  and  trying  an  ordinary  match
2291           again. There is some code that demonstrates how to do this in the pcre-
2292           demo sample program. In the most general case, you have to check to see
2293           if  the newline convention recognizes CRLF as a newline, and if so, and
2294           the current character is CR followed by LF, advance the starting offset
2295           by two characters instead of one.
2296    
2297         If  a  non-zero starting offset is passed when the pattern is anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
2298         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2299         if  the  pattern  does  not require the match to be at the start of the         if  the  pattern  does  not require the match to be at the start of the
# Line 2309  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2366  MATCHING A PATTERN: THE TRADITIONAL FUNC
2366         expression are also set to -1. For example,  if  the  string  "abc"  is         expression are also set to -1. For example,  if  the  string  "abc"  is
2367         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2368         matched. The return from the function is 2, because  the  highest  used         matched. The return from the function is 2, because  the  highest  used
2369         capturing subpattern number is 1. However, you can refer to the offsets         capturing  subpattern  number  is 1, and the offsets for for the second
2370         for the second and third capturing subpatterns if  you  wish  (assuming         and third capturing subpatterns (assuming the vector is  large  enough,
2371         the vector is large enough, of course).         of course) are set to -1.
2372    
2373           Note: Elements of ovector that do not correspond to capturing parenthe-
2374           ses in the pattern are never changed. That is, if a pattern contains  n
2375           capturing parentheses, no more than ovector[0] to ovector[2n+1] are set
2376           by pcre_exec(). The other elements retain whatever values  they  previ-
2377           ously had.
2378    
2379         Some  convenience  functions  are  provided for extracting the captured         Some  convenience  functions  are  provided for extracting the captured
2380         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
# Line 2381  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2444  MATCHING A PATTERN: THE TRADITIONAL FUNC
2444           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2445    
2446         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2447         subject.         subject.   However,  if  PCRE_PARTIAL_HARD  is set and the problem is a
2448           truncated UTF-8 character at the end of the subject,  PCRE_ERROR_SHORT-
2449           UTF8 is used instead.
2450    
2451           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2452    
2453         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
2454         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2455         ter.         ter or the end of the subject.
2456    
2457           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2458    
# Line 2420  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2485  MATCHING A PATTERN: THE TRADITIONAL FUNC
2485    
2486         An invalid combination of PCRE_NEWLINE_xxx options was given.         An invalid combination of PCRE_NEWLINE_xxx options was given.
2487    
2488             PCRE_ERROR_BADOFFSET      (-24)
2489    
2490           The value of startoffset was negative or greater than the length of the
2491           subject, that is, the value in length.
2492    
2493             PCRE_ERROR_SHORTUTF8      (-25)
2494    
2495           The  subject  string ended with an incomplete (truncated) UTF-8 charac-
2496           ter, and the PCRE_PARTIAL_HARD option was  set.  Without  this  option,
2497           PCRE_ERROR_BADUTF8 is returned in this situation.
2498    
2499         Error numbers -16 to -20 and -22 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2500    
2501    
# Line 2436  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2512  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2512         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2513              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2514    
2515         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
2516         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
2517         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2518         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
2519         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
2520         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
2521         substrings.         substrings.
2522    
2523         A substring that contains a binary zero is correctly extracted and  has         A  substring that contains a binary zero is correctly extracted and has
2524         a  further zero added on the end, but the result is not, of course, a C         a further zero added on the end, but the result is not, of course, a  C
2525         string.  However, you can process such a string  by  referring  to  the         string.   However,  you  can  process such a string by referring to the
2526         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
2527         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2528         not  adequate for handling strings containing binary zeros, because the         not adequate for handling strings containing binary zeros, because  the
2529         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2530    
2531         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
2532         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
2533         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2534         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2535         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
2536         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2537         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
2538         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
2539         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2540    
2541         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
2542         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
2543         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
2544         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
2545         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
2546         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
2547         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
2548         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
2549         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
2550    
2551           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2552    
2553         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
2554         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2555    
2556           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2557    
2558         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2559    
2560         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
2561         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
2562         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2563         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
2564         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
2565         pointer. The yield of the function is zero if all  went  well,  or  the         pointer.  The  yield  of  the function is zero if all went well, or the
2566         error code         error code
2567    
2568           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2569    
2570         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
2571    
2572         When  any of these functions encounter a substring that is unset, which         When any of these functions encounter a substring that is unset,  which
2573         can happen when capturing subpattern number n+1 matches  some  part  of         can  happen  when  capturing subpattern number n+1 matches some part of
2574         the  subject, but subpattern n has not been used at all, they return an         the subject, but subpattern n has not been used at all, they return  an
2575         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
2576         string  by inspecting the appropriate offset in ovector, which is nega-         string by inspecting the appropriate offset in ovector, which is  nega-
2577         tive for unset substrings.         tive for unset substrings.
2578    
2579         The two convenience functions pcre_free_substring() and  pcre_free_sub-         The  two convenience functions pcre_free_substring() and pcre_free_sub-
2580         string_list()  can  be  used  to free the memory returned by a previous         string_list() can be used to free the memory  returned  by  a  previous
2581         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2582         tively.  They  do  nothing  more  than  call the function pointed to by         tively. They do nothing more than  call  the  function  pointed  to  by
2583         pcre_free, which of course could be called directly from a  C  program.         pcre_free,  which  of course could be called directly from a C program.
2584         However,  PCRE is used in some situations where it is linked via a spe-         However, PCRE is used in some situations where it is linked via a  spe-
2585         cial  interface  to  another  programming  language  that  cannot   use         cial   interface  to  another  programming  language  that  cannot  use
2586         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free directly; it is for these cases that the functions  are  pro-
2587         vided.         vided.
2588    
2589    
# Line 2526  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2602  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2602              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2603              const char **stringptr);              const char **stringptr);
2604    
2605         To extract a substring by name, you first have to find associated  num-         To  extract a substring by name, you first have to find associated num-
2606         ber.  For example, for this pattern         ber.  For example, for this pattern
2607    
2608           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 2535  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2611  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2611         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2612         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
2613         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
2614         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
2615         subpattern of that name.         subpattern of that name.
2616    
2617         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2618         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2619         are also two functions that do the whole job.         are also two functions that do the whole job.
2620    
2621         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most    of    the    arguments   of   pcre_copy_named_substring()   and
2622         pcre_get_named_substring()  are  the  same  as  those for the similarly         pcre_get_named_substring() are the same  as  those  for  the  similarly
2623         named functions that extract by number. As these are described  in  the         named  functions  that extract by number. As these are described in the
2624         previous  section,  they  are not re-described here. There are just two         previous section, they are not re-described here. There  are  just  two
2625         differences:         differences:
2626    
2627         First, instead of a substring number, a substring name is  given.  Sec-         First,  instead  of a substring number, a substring name is given. Sec-
2628         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2629         to the compiled pattern. This is needed in order to gain access to  the         to  the compiled pattern. This is needed in order to gain access to the
2630         name-to-number translation table.         name-to-number translation table.
2631    
2632         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These functions call pcre_get_stringnumber(), and if it succeeds,  they
2633         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
2634         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
2635         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
2636    
2637         Warning: If the pattern uses the (?| feature to set up multiple subpat-         Warning: If the pattern uses the (?| feature to set up multiple subpat-
2638         terns  with  the  same number, as described in the section on duplicate         terns with the same number, as described in the  section  on  duplicate
2639         subpattern numbers in the pcrepattern page, you  cannot  use  names  to         subpattern  numbers  in  the  pcrepattern page, you cannot use names to
2640         distinguish  the  different subpatterns, because names are not included         distinguish the different subpatterns, because names are  not  included
2641         in the compiled code. The matching process uses only numbers. For  this         in  the compiled code. The matching process uses only numbers. For this
2642         reason,  the  use of different names for subpatterns of the same number         reason, the use of different names for subpatterns of the  same  number
2643         causes an error at compile time.         causes an error at compile time.
2644    
2645    
# Line 2572  DUPLICATE SUBPATTERN NAMES Line 2648  DUPLICATE SUBPATTERN NAMES
2648         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2649              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2650    
2651         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2652         subpatterns  are not required to be unique. (Duplicate names are always         subpatterns are not required to be unique. (Duplicate names are  always
2653         allowed for subpatterns with the same number, created by using the  (?|         allowed  for subpatterns with the same number, created by using the (?|
2654         feature.  Indeed,  if  such subpatterns are named, they are required to         feature. Indeed, if such subpatterns are named, they  are  required  to
2655         use the same names.)         use the same names.)
2656    
2657         Normally, patterns with duplicate names are such that in any one match,         Normally, patterns with duplicate names are such that in any one match,
2658         only  one of the named subpatterns participates. An example is shown in         only one of the named subpatterns participates. An example is shown  in
2659         the pcrepattern documentation.         the pcrepattern documentation.
2660    
2661         When   duplicates   are   present,   pcre_copy_named_substring()    and         When    duplicates   are   present,   pcre_copy_named_substring()   and
2662         pcre_get_named_substring()  return the first substring corresponding to         pcre_get_named_substring() return the first substring corresponding  to
2663         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING         the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
2664         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()         (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
2665         function returns one of the numbers that are associated with the  name,         function  returns one of the numbers that are associated with the name,
2666         but it is not defined which it is.         but it is not defined which it is.
2667    
2668         If  you want to get full details of all captured substrings for a given         If you want to get full details of all captured substrings for a  given
2669         name, you must use  the  pcre_get_stringtable_entries()  function.  The         name,  you  must  use  the pcre_get_stringtable_entries() function. The
2670         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2671         third and fourth are pointers to variables which  are  updated  by  the         third  and  fourth  are  pointers to variables which are updated by the
2672         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2673         the name-to-number table  for  the  given  name.  The  function  itself         the  name-to-number  table  for  the  given  name.  The function itself
2674         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
2675         there are none. The format of the table is described above in the  sec-         there  are none. The format of the table is described above in the sec-
2676         tion  entitled  Information  about  a  pattern.  Given all the relevant         tion entitled Information about a  pattern.   Given  all  the  relevant
2677         entries for the name, you can extract each of their numbers, and  hence         entries  for the name, you can extract each of their numbers, and hence
2678         the captured data, if any.         the captured data, if any.
2679    
2680    
2681  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2682    
2683         The  traditional  matching  function  uses a similar algorithm to Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
2684         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2685         the  subject.  If you want to find all possible matches, or the longest         the subject. If you want to find all possible matches, or  the  longest
2686         possible match, consider using the alternative matching  function  (see         possible  match,  consider using the alternative matching function (see
2687         below)  instead.  If you cannot use the alternative function, but still         below) instead. If you cannot use the alternative function,  but  still
2688         need to find all possible matches, you can kludge it up by  making  use         need  to  find all possible matches, you can kludge it up by making use
2689         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2690         tation.         tation.
2691    
2692         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2693         tern.   When your callout function is called, extract and save the cur-         tern.  When your callout function is called, extract and save the  cur-
2694         rent matched substring. Then return  1,  which  forces  pcre_exec()  to         rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2695         backtrack  and  try other alternatives. Ultimately, when it runs out of         backtrack and try other alternatives. Ultimately, when it runs  out  of
2696         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2697    
2698    
# Line 2627  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2703  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2703              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2704              int *workspace, int wscount);              int *workspace, int wscount);
2705    
2706         The function pcre_dfa_exec()  is  called  to  match  a  subject  string         The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2707         against  a  compiled pattern, using a matching algorithm that scans the         against a compiled pattern, using a matching algorithm that  scans  the
2708         subject string just once, and does not backtrack.  This  has  different         subject  string  just  once, and does not backtrack. This has different
2709         characteristics  to  the  normal  algorithm, and is not compatible with         characteristics to the normal algorithm, and  is  not  compatible  with
2710         Perl. Some of the features of PCRE patterns are not  supported.  Never-         Perl.  Some  of the features of PCRE patterns are not supported. Never-
2711         theless,  there are times when this kind of matching can be useful. For         theless, there are times when this kind of matching can be useful.  For
2712         a discussion of the two matching algorithms, and  a  list  of  features         a  discussion  of  the  two matching algorithms, and a list of features
2713         that  pcre_dfa_exec() does not support, see the pcrematching documenta-         that pcre_dfa_exec() does not support, see the pcrematching  documenta-
2714         tion.         tion.
2715    
2716         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2717         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2718         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2719         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2720         repeated here.         repeated here.
2721    
2722         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2723         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2724         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2725         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2726         lot of potential matches.         lot of potential matches.
2727    
2728         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2668  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2744  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2744    
2745     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2746    
2747         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2748         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2749         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
2750         PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,         PCRE_NOTEMPTY_ATSTART,      PCRE_NO_UTF8_CHECK,       PCRE_BSR_ANYCRLF,
2751         PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-         PCRE_BSR_UNICODE,  PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
2752         TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last         TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.  All but  the  last
2753         four of these are  exactly  the  same  as  for  pcre_exec(),  so  their         four  of  these  are  exactly  the  same  as  for pcre_exec(), so their
2754         description is not repeated here.         description is not repeated here.
2755    
2756           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2757           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2758    
2759         These  have the same general effect as they do for pcre_exec(), but the         These have the same general effect as they do for pcre_exec(), but  the
2760         details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for         details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for
2761         pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-         pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-
2762         ject is reached and there is still at least  one  matching  possibility         ject  is  reached  and there is still at least one matching possibility
2763         that requires additional characters. This happens even if some complete         that requires additional characters. This happens even if some complete
2764         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2765         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2766         of the subject is reached, there have been  no  complete  matches,  but         of  the  subject  is  reached, there have been no complete matches, but
2767         there  is  still  at least one matching possibility. The portion of the         there is still at least one matching possibility. The  portion  of  the
2768         string that was inspected when the longest partial match was  found  is         string  that  was inspected when the longest partial match was found is
2769         set as the first matching string in both cases.         set as the first matching string  in  both  cases.   There  is  a  more
2770           detailed  discussion  of partial and multi-segment matching, with exam-
2771           ples, in the pcrepartial documentation.
2772    
2773           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2774    
2775         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2776         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2777         tive  algorithm  works, this is necessarily the shortest possible match         tive algorithm works, this is necessarily the shortest  possible  match
2778         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2779    
2780           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2781    
2782         When pcre_dfa_exec() returns a partial match, it is possible to call it         When pcre_dfa_exec() returns a partial match, it is possible to call it
2783         again,  with  additional  subject characters, and have it continue with         again, with additional subject characters, and have  it  continue  with
2784         the same match. The PCRE_DFA_RESTART option requests this action;  when         the  same match. The PCRE_DFA_RESTART option requests this action; when
2785         it  is  set,  the workspace and wscount options must reference the same         it is set, the workspace and wscount options must  reference  the  same
2786         vector as before because data about the match so far is  left  in  them         vector  as  before  because data about the match so far is left in them
2787         after a partial match. There is more discussion of this facility in the         after a partial match. There is more discussion of this facility in the
2788         pcrepartial documentation.         pcrepartial documentation.
2789    
2790     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2791    
2792         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
2793         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
2794         of the function start at the same point in  the  subject.  The  shorter         of  the  function  start  at the same point in the subject. The shorter
2795         matches  are all initial substrings of the longer matches. For example,         matches are all initial substrings of the longer matches. For  example,
2796         if the pattern         if the pattern
2797    
2798           <.*>           <.*>
# Line 2729  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2807  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2807           <something> <something else>           <something> <something else>
2808           <something> <something else> <something further>           <something> <something else> <something further>
2809    
2810         On success, the yield of the function is a number  greater  than  zero,         On  success,  the  yield of the function is a number greater than zero,
2811         which  is  the  number of matched substrings. The substrings themselves         which is the number of matched substrings.  The  substrings  themselves
2812         are returned in ovector. Each string uses two elements;  the  first  is         are  returned  in  ovector. Each string uses two elements; the first is
2813         the  offset  to  the start, and the second is the offset to the end. In         the offset to the start, and the second is the offset to  the  end.  In
2814         fact, all the strings have the same start  offset.  (Space  could  have         fact,  all  the  strings  have the same start offset. (Space could have
2815         been  saved by giving this only once, but it was decided to retain some         been saved by giving this only once, but it was decided to retain  some
2816         compatibility with the way pcre_exec() returns data,  even  though  the         compatibility  with  the  way pcre_exec() returns data, even though the
2817         meaning of the strings is different.)         meaning of the strings is different.)
2818    
2819         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2820         est matching string is given first. If there were too many  matches  to         est  matching  string is given first. If there were too many matches to
2821         fit  into ovector, the yield of the function is zero, and the vector is         fit into ovector, the yield of the function is zero, and the vector  is
2822         filled with the longest matches.         filled with the longest matches.
2823    
2824     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2825    
2826         The pcre_dfa_exec() function returns a negative number when  it  fails.         The  pcre_dfa_exec()  function returns a negative number when it fails.
2827         Many  of  the  errors  are  the  same as for pcre_exec(), and these are         Many of the errors are the same  as  for  pcre_exec(),  and  these  are
2828         described above.  There are in addition the following errors  that  are         described  above.   There are in addition the following errors that are
2829         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2830    
2831           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2832    
2833         This  return is given if pcre_dfa_exec() encounters an item in the pat-         This return is given if pcre_dfa_exec() encounters an item in the  pat-
2834         tern that it does not support, for instance, the use of \C  or  a  back         tern  that  it  does not support, for instance, the use of \C or a back
2835         reference.         reference.
2836    
2837           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2838    
2839         This  return  is  given  if pcre_dfa_exec() encounters a condition item         This return is given if pcre_dfa_exec()  encounters  a  condition  item
2840         that uses a back reference for the condition, or a test  for  recursion         that  uses  a back reference for the condition, or a test for recursion
2841         in a specific group. These are not supported.         in a specific group. These are not supported.
2842    
2843           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2844    
2845         This  return  is given if pcre_dfa_exec() is called with an extra block         This return is given if pcre_dfa_exec() is called with an  extra  block
2846         that contains a setting of the match_limit field. This is not supported         that contains a setting of the match_limit field. This is not supported
2847         (it is meaningless).         (it is meaningless).
2848    
2849           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2850    
2851         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
2852         workspace vector.         workspace vector.
2853    
2854           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
2855    
2856         When a recursive subpattern is processed, the matching  function  calls         When  a  recursive subpattern is processed, the matching function calls
2857         itself  recursively,  using  private vectors for ovector and workspace.         itself recursively, using private vectors for  ovector  and  workspace.
2858         This error is given if the output vector  is  not  large  enough.  This         This  error  is  given  if  the output vector is not large enough. This
2859         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2860    
2861    
2862  SEE ALSO  SEE ALSO
2863    
2864         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-         pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
2865         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2866    
2867    
# Line 2796  AUTHOR Line 2874  AUTHOR
2874    
2875  REVISION  REVISION
2876    
2877         Last updated: 21 June 2010         Last updated: 21 November 2010
2878         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
2879  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2880    
# Line 2864  MISSING CALLOUTS Line 2942  MISSING CALLOUTS
2942         patterns, if it has been scanned far enough.         patterns, if it has been scanned far enough.
2943    
2944         You  can disable these optimizations by passing the PCRE_NO_START_OPTI-         You  can disable these optimizations by passing the PCRE_NO_START_OPTI-
2945         MIZE option to pcre_exec() or  pcre_dfa_exec().  This  slows  down  the         MIZE option to pcre_compile(), pcre_exec(), or pcre_dfa_exec(),  or  by
2946         matching  process,  but  does  ensure that callouts such as the example         starting the pattern with (*NO_START_OPT). This slows down the matching
2947         above are obeyed.         process, but does ensure that callouts such as the  example  above  are
2948           obeyed.
2949    
2950    
2951  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
2952    
2953         During matching, when PCRE reaches a callout point, the external  func-         During  matching, when PCRE reaches a callout point, the external func-
2954         tion  defined by pcre_callout is called (if it is set). This applies to         tion defined by pcre_callout is called (if it is set). This applies  to
2955         both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The         both  the  pcre_exec()  and the pcre_dfa_exec() matching functions. The
2956         only  argument  to  the callout function is a pointer to a pcre_callout         only argument to the callout function is a pointer  to  a  pcre_callout
2957         block. This structure contains the following fields:         block. This structure contains the following fields:
2958    
2959           int          version;           int          version;
# Line 2890  THE CALLOUT INTERFACE Line 2969  THE CALLOUT INTERFACE
2969           int          pattern_position;           int          pattern_position;
2970           int          next_item_length;           int          next_item_length;
2971    
2972         The version field is an integer containing the version  number  of  the         The  version  field  is an integer containing the version number of the
2973         block  format. The initial version was 0; the current version is 1. The         block format. The initial version was 0; the current version is 1.  The
2974         version number will change again in future  if  additional  fields  are         version  number  will  change  again in future if additional fields are
2975         added, but the intention is never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2976    
2977         The  callout_number  field  contains the number of the callout, as com-         The callout_number field contains the number of the  callout,  as  com-
2978         piled into the pattern (that is, the number after ?C for  manual  call-         piled  into  the pattern (that is, the number after ?C for manual call-
2979         outs, and 255 for automatically generated callouts).         outs, and 255 for automatically generated callouts).
2980    
2981         The  offset_vector field is a pointer to the vector of offsets that was         The offset_vector field is a pointer to the vector of offsets that  was
2982         passed  by  the  caller  to  pcre_exec()   or   pcre_dfa_exec().   When         passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2983         pcre_exec()  is used, the contents can be inspected in order to extract         pcre_exec() is used, the contents can be inspected in order to  extract
2984         substrings that have been matched so  far,  in  the  same  way  as  for         substrings  that  have  been  matched  so  far,  in the same way as for
2985         extracting  substrings after a match has completed. For pcre_dfa_exec()         extracting substrings after a match has completed. For  pcre_dfa_exec()
2986         this field is not useful.         this field is not useful.
2987    
2988         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2989         were passed to pcre_exec().         were passed to pcre_exec().
2990    
2991         The  start_match  field normally contains the offset within the subject         The start_match field normally contains the offset within  the  subject
2992         at which the current match attempt  started.  However,  if  the  escape         at  which  the  current  match  attempt started. However, if the escape
2993         sequence  \K has been encountered, this value is changed to reflect the         sequence \K has been encountered, this value is changed to reflect  the
2994         modified starting point. If the pattern is not  anchored,  the  callout         modified  starting  point.  If the pattern is not anchored, the callout
2995         function may be called several times from the same point in the pattern         function may be called several times from the same point in the pattern
2996         for different starting points in the subject.         for different starting points in the subject.
2997    
2998         The current_position field contains the offset within  the  subject  of         The  current_position  field  contains the offset within the subject of
2999         the current match pointer.         the current match pointer.
3000    
3001         When  the  pcre_exec() function is used, the capture_top field contains         When the pcre_exec() function is used, the capture_top  field  contains
3002         one more than the number of the highest numbered captured substring  so         one  more than the number of the highest numbered captured substring so
3003         far.  If  no substrings have been captured, the value of capture_top is         far. If no substrings have been captured, the value of  capture_top  is
3004         one. This is always the case when pcre_dfa_exec() is used,  because  it         one.  This  is always the case when pcre_dfa_exec() is used, because it
3005         does not support captured substrings.         does not support captured substrings.
3006    
3007         The  capture_last  field  contains the number of the most recently cap-         The capture_last field contains the number of the  most  recently  cap-
3008         tured substring. If no substrings have been captured, its value is  -1.         tured  substring. If no substrings have been captured, its value is -1.
3009         This is always the case when pcre_dfa_exec() is used.         This is always the case when pcre_dfa_exec() is used.
3010    
3011         The  callout_data  field contains a value that is passed to pcre_exec()         The callout_data field contains a value that is passed  to  pcre_exec()
3012         or pcre_dfa_exec() specifically so that it can be passed back in  call-         or  pcre_dfa_exec() specifically so that it can be passed back in call-
3013         outs.  It  is  passed  in the pcre_callout field of the pcre_extra data         outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
3014         structure. If no such data was passed, the value of callout_data  in  a         structure.  If  no such data was passed, the value of callout_data in a
3015         pcre_callout  block  is  NULL. There is a description of the pcre_extra         pcre_callout block is NULL. There is a description  of  the  pcre_extra
3016         structure in the pcreapi documentation.         structure in the pcreapi documentation.
3017    
3018         The pattern_position field is present from version 1 of the  pcre_call-         The  pattern_position field is present from version 1 of the pcre_call-
3019         out structure. It contains the offset to the next item to be matched in         out structure. It contains the offset to the next item to be matched in
3020         the pattern string.         the pattern string.
3021    
3022         The next_item_length field is present from version 1 of the  pcre_call-         The  next_item_length field is present from version 1 of the pcre_call-
3023         out structure. It contains the length of the next item to be matched in         out structure. It contains the length of the next item to be matched in
3024         the pattern string. When the callout immediately precedes  an  alterna-         the  pattern  string. When the callout immediately precedes an alterna-
3025         tion  bar, a closing parenthesis, or the end of the pattern, the length         tion bar, a closing parenthesis, or the end of the pattern, the  length
3026         is zero. When the callout precedes an opening parenthesis,  the  length         is  zero.  When the callout precedes an opening parenthesis, the length
3027         is that of the entire subpattern.         is that of the entire subpattern.
3028    
3029         The  pattern_position  and next_item_length fields are intended to help         The pattern_position and next_item_length fields are intended  to  help
3030         in distinguishing between different automatic callouts, which all  have         in  distinguishing between different automatic callouts, which all have
3031         the same callout number. However, they are set for all callouts.         the same callout number. However, they are set for all callouts.
3032    
3033    
3034  RETURN VALUES  RETURN VALUES
3035    
3036         The  external callout function returns an integer to PCRE. If the value         The external callout function returns an integer to PCRE. If the  value
3037         is zero, matching proceeds as normal. If  the  value  is  greater  than         is  zero,  matching  proceeds  as  normal. If the value is greater than
3038         zero,  matching  fails  at  the current point, but the testing of other         zero, matching fails at the current point, but  the  testing  of  other
3039         matching possibilities goes ahead, just as if a lookahead assertion had         matching possibilities goes ahead, just as if a lookahead assertion had
3040         failed.  If  the  value  is less than zero, the match is abandoned, and         failed. If the value is less than zero, the  match  is  abandoned,  and
3041         pcre_exec() or pcre_dfa_exec() returns the negative value.         pcre_exec() or pcre_dfa_exec() returns the negative value.
3042    
3043         Negative  values  should  normally  be   chosen   from   the   set   of         Negative   values   should   normally   be   chosen  from  the  set  of
3044         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
3045         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
3046         reserved  for  use  by callout functions; it will never be used by PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
3047         itself.         itself.
3048    
3049    
# Line 2977  AUTHOR Line 3056  AUTHOR
3056    
3057  REVISION  REVISION
3058    
3059         Last updated: 29 September 2009         Last updated: 21 November 2010
3060         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
3061  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3062    
3063    
# Line 2993  DIFFERENCES BETWEEN PCRE AND PERL Line 3072  DIFFERENCES BETWEEN PCRE AND PERL
3072    
3073         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
3074         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
3075         respect to Perl 5.10/5.11.         respect to Perl versions 5.10 and above.
3076    
3077         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
3078         of what it does have are given in the section on UTF-8 support  in  the         of what it does have are given in the section on UTF-8 support  in  the
# Line 3075  DIFFERENCES BETWEEN PCRE AND PERL Line 3154  DIFFERENCES BETWEEN PCRE AND PERL
3154         turing subpattern number 1. To avoid this confusing situation, an error         turing subpattern number 1. To avoid this confusing situation, an error
3155         is given at compile time.         is given at compile time.
3156    
3157         12. PCRE provides some extensions to the Perl regular expression facil-         12. Perl recognizes comments in some  places  that  PCRE  doesn't,  for
3158         ities.   Perl  5.10  includes new features that are not in earlier ver-         example, between the ( and ? at the start of a subpattern.
3159         sions of Perl, some of which (such as named parentheses) have  been  in  
3160           13. PCRE provides some extensions to the Perl regular expression facil-
3161           ities.  Perl 5.10 includes new features that are not  in  earlier  ver-
3162           sions  of  Perl, some of which (such as named parentheses) have been in
3163         PCRE for some time. This list is with respect to Perl 5.10:         PCRE for some time. This list is with respect to Perl 5.10:
3164    
3165         (a)  Although  lookbehind  assertions  in  PCRE must match fixed length         (a) Although lookbehind assertions in  PCRE  must  match  fixed  length
3166         strings, each alternative branch of a lookbehind assertion can match  a         strings,  each alternative branch of a lookbehind assertion can match a
3167         different  length  of  string.  Perl requires them all to have the same         different length of string. Perl requires them all  to  have  the  same
3168         length.         length.
3169    
3170         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
3171         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
3172    
3173         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
3174         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
3175         ignored.  (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
3176    
3177         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
3178         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
3179         lowed by a question mark they are.         lowed by a question mark they are.
3180    
# Line 3100  DIFFERENCES BETWEEN PCRE AND PERL Line 3182  DIFFERENCES BETWEEN PCRE AND PERL
3182         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
3183    
3184         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3185         and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-         and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
3186         lents.         lents.
3187    
3188         (g) The \R escape sequence can be restricted to match only CR,  LF,  or         (g)  The  \R escape sequence can be restricted to match only CR, LF, or
3189         CRLF by the PCRE_BSR_ANYCRLF option.         CRLF by the PCRE_BSR_ANYCRLF option.
3190    
3191         (h) The callout facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
# Line 3113  DIFFERENCES BETWEEN PCRE AND PERL Line 3195  DIFFERENCES BETWEEN PCRE AND PERL
3195         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3196         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
3197    
3198         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
3199         different way and is not Perl-compatible.         different way and is not Perl-compatible.
3200    
3201         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
3202         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
3203         pattern.         pattern.
3204    
# Line 3130  AUTHOR Line 3212  AUTHOR
3212    
3213  REVISION  REVISION
3214    
3215         Last updated: 12 May 2010         Last updated: 31 October 2010
3216         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
3217  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3218    
# Line 3183  PCRE REGULAR EXPRESSION DETAILS Line 3265  PCRE REGULAR EXPRESSION DETAILS
3265         character types, instead of recognizing only characters with codes less         character types, instead of recognizing only characters with codes less
3266         than 128 via a lookup table.         than 128 via a lookup table.
3267    
3268         The remainder of this document discusses the  patterns  that  are  sup-         If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
3269         ported  by  PCRE when its main matching function, pcre_exec(), is used.         setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
3270         From  release  6.0,   PCRE   offers   a   second   matching   function,         time. There are also some more of these special sequences that are con-
3271         pcre_dfa_exec(),  which matches using a different algorithm that is not         cerned with the handling of newlines; they are described below.
3272    
3273           The  remainder  of  this  document discusses the patterns that are sup-
3274           ported by PCRE when its main matching function, pcre_exec(),  is  used.
3275           From   release   6.0,   PCRE   offers   a   second  matching  function,
3276           pcre_dfa_exec(), which matches using a different algorithm that is  not
3277         Perl-compatible. Some of the features discussed below are not available         Perl-compatible. Some of the features discussed below are not available
3278         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the         when pcre_dfa_exec() is used. The advantages and disadvantages  of  the
3279         alternative function, and how it differs from the normal function,  are         alternative  function, and how it differs from the normal function, are
3280         discussed in the pcrematching page.         discussed in the pcrematching page.
3281    
3282    
3283  NEWLINE CONVENTIONS  NEWLINE CONVENTIONS
3284    
3285         PCRE  supports five different conventions for indicating line breaks in         PCRE supports five different conventions for indicating line breaks  in
3286         strings: a single CR (carriage return) character, a  single  LF  (line-         strings:  a  single  CR (carriage return) character, a single LF (line-
3287         feed) character, the two-character sequence CRLF, any of the three pre-         feed) character, the two-character sequence CRLF, any of the three pre-
3288         ceding, or any Unicode newline sequence. The pcreapi page  has  further         ceding,  or  any Unicode newline sequence. The pcreapi page has further
3289         discussion  about newlines, and shows how to set the newline convention         discussion about newlines, and shows how to set the newline  convention
3290         in the options arguments for the compiling and matching functions.         in the options arguments for the compiling and matching functions.
3291    
3292         It is also possible to specify a newline convention by starting a  pat-         It  is also possible to specify a newline convention by starting a pat-
3293         tern string with one of the following five sequences:         tern string with one of the following five sequences:
3294    
3295           (*CR)        carriage return           (*CR)        carriage return
# Line 3211  NEWLINE CONVENTIONS Line 3298  NEWLINE CONVENTIONS
3298           (*ANYCRLF)   any of the three above           (*ANYCRLF)   any of the three above
3299           (*ANY)       all Unicode newline sequences           (*ANY)       all Unicode newline sequences
3300    
3301         These  override  the default and the options given to pcre_compile() or         These override the default and the options given to  pcre_compile()  or
3302         pcre_compile2(). For example, on a Unix system where LF is the  default         pcre_compile2().  For example, on a Unix system where LF is the default
3303         newline sequence, the pattern         newline sequence, the pattern
3304    
3305           (*CR)a.b           (*CR)a.b
3306    
3307         changes the convention to CR. That pattern matches "a\nb" because LF is         changes the convention to CR. That pattern matches "a\nb" because LF is
3308         no longer a newline. Note that these special settings,  which  are  not         no  longer  a  newline. Note that these special settings, which are not
3309         Perl-compatible,  are  recognized  only at the very start of a pattern,         Perl-compatible, are recognized only at the very start  of  a  pattern,
3310         and that they must be in upper case.  If  more  than  one  of  them  is         and  that  they  must  be  in  upper  case. If more than one of them is
3311         present, the last one is used.         present, the last one is used.
3312    
3313         The  newline convention affects the interpretation of the dot metachar-         The newline convention affects the interpretation of the dot  metachar-
3314         acter when PCRE_DOTALL is not set, and also the behaviour of  \N.  How-         acter  when  PCRE_DOTALL is not set, and also the behaviour of \N. How-
3315         ever,  it  does  not  affect  what  the  \R escape sequence matches. By         ever, it does not affect  what  the  \R  escape  sequence  matches.  By
3316         default, this is any Unicode newline sequence, for Perl  compatibility.         default,  this is any Unicode newline sequence, for Perl compatibility.
3317         However,  this can be changed; see the description of \R in the section         However, this can be changed; see the description of \R in the  section
3318         entitled "Newline sequences" below. A change of \R setting can be  com-         entitled  "Newline sequences" below. A change of \R setting can be com-
3319         bined with a change of newline convention.         bined with a change of newline convention.
3320    
3321    
3322  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
3323    
3324         A  regular  expression  is  a pattern that is matched against a subject         A regular expression is a pattern that is  matched  against  a  subject
3325         string from left to right. Most characters stand for  themselves  in  a         string  from  left  to right. Most characters stand for themselves in a
3326         pattern,  and  match  the corresponding characters in the subject. As a         pattern, and match the corresponding characters in the  subject.  As  a
3327         trivial example, the pattern         trivial example, the pattern
3328    
3329           The quick brown fox           The quick brown fox
3330    
3331         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
3332         caseless  matching is specified (the PCRE_CASELESS option), letters are         caseless matching is specified (the PCRE_CASELESS option), letters  are
3333         matched independently of case. In UTF-8 mode, PCRE  always  understands         matched  independently  of case. In UTF-8 mode, PCRE always understands
3334         the  concept  of case for characters whose values are less than 128, so         the concept of case for characters whose values are less than  128,  so
3335         caseless matching is always possible. For characters with  higher  val-         caseless  matching  is always possible. For characters with higher val-
3336         ues,  the concept of case is supported if PCRE is compiled with Unicode         ues, the concept of case is supported if PCRE is compiled with  Unicode
3337         property support, but not otherwise.   If  you  want  to  use  caseless         property  support,  but  not  otherwise.   If  you want to use caseless
3338         matching  for  characters  128  and above, you must ensure that PCRE is         matching for characters 128 and above, you must  ensure  that  PCRE  is
3339         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
3340    
3341         The power of regular expressions comes  from  the  ability  to  include         The  power  of  regular  expressions  comes from the ability to include
3342         alternatives  and  repetitions in the pattern. These are encoded in the         alternatives and repetitions in the pattern. These are encoded  in  the
3343         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
3344         but instead are interpreted in some special way.         but instead are interpreted in some special way.
3345    
3346         There  are  two different sets of metacharacters: those that are recog-         There are two different sets of metacharacters: those that  are  recog-
3347         nized anywhere in the pattern except within square brackets, and  those         nized  anywhere in the pattern except within square brackets, and those
3348         that  are  recognized  within square brackets. Outside square brackets,         that are recognized within square brackets.  Outside  square  brackets,
3349         the metacharacters are as follows:         the metacharacters are as follows:
3350    
3351           \      general escape character with several uses           \      general escape character with several uses
# Line 3277  CHARACTERS AND METACHARACTERS Line 3364  CHARACTERS AND METACHARACTERS
3364                  also "possessive quantifier"                  also "possessive quantifier"
3365           {      start min/max quantifier           {      start min/max quantifier
3366    
3367         Part of a pattern that is in square brackets  is  called  a  "character         Part  of  a  pattern  that is in square brackets is called a "character
3368         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
3369    
3370           \      general escape character           \      general escape character
# Line 3293  CHARACTERS AND METACHARACTERS Line 3380  CHARACTERS AND METACHARACTERS
3380  BACKSLASH  BACKSLASH
3381    
3382         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
3383         a non-alphanumeric character, it takes away any  special  meaning  that         a character that is not a number or a letter, it takes away any special
3384         character  may  have.  This  use  of  backslash  as an escape character         meaning that character may have. This use of  backslash  as  an  escape
3385         applies both inside and outside character classes.         character applies both inside and outside character classes.
3386    
3387         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
3388         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
3389         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
3390         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
3391         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
3392         slash, you write \\.         slash, you write \\.
3393    
3394         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         In UTF-8 mode, only ASCII numbers and letters have any special  meaning
3395         the pattern (other than in a character class) and characters between  a         after  a  backslash.  All  other characters (in particular, those whose
3396           codepoints are greater than 127) are treated as literals.
3397    
3398           If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
3399           the  pattern (other than in a character class) and characters between a
3400         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
3401         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
3402         part of the pattern.         part of the pattern.
3403    
3404         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
3405         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
3406         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
3407         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
3408         tion. Note the following examples:         tion. Note the following examples:
3409    
3410           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 3323  BACKSLASH Line 3414  BACKSLASH
3414           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
3415           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3416    
3417         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
3418         classes.         classes.  An isolated \E that is not preceded by \Q is ignored.
3419    
3420     Non-printing characters     Non-printing characters
3421    
3422         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
3423         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
3424         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
3425         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
3426         editing, it is  often  easier  to  use  one  of  the  following  escape         editing,  it  is  often  easier  to  use  one  of  the following escape
3427         sequences than the binary character it represents:         sequences than the binary character it represents:
3428    
3429           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
3430           \cx       "control-x", where x is any character           \cx       "control-x", where x is any ASCII character
3431           \e        escape (hex 1B)           \e        escape (hex 1B)
3432           \f        formfeed (hex 0C)           \f        formfeed (hex 0C)
3433           \n        linefeed (hex 0A)           \n        linefeed (hex 0A)
# Line 3346  BACKSLASH Line 3437  BACKSLASH
3437           \xhh      character with hex code hh           \xhh      character with hex code hh
3438           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
3439    
3440         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
3441         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
3442         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({
3443         becomes hex 7B.         is  7B),  while  \c; becomes hex 7B (; is 3B). If the byte following \c
3444           has a value greater than 127, a compile-time error occurs.  This  locks
3445         After \x, from zero to two hexadecimal digits are read (letters can  be         out  non-ASCII  characters in both byte mode and UTF-8 mode. (When PCRE
3446         in  upper  or  lower case). Any number of hexadecimal digits may appear         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case
3447         between \x{ and }, but the value of the character  code  must  be  less         letter is converted to upper case, and then the 0xc0 bits are flipped.)
3448    
3449           After  \x, from zero to two hexadecimal digits are read (letters can be
3450           in upper or lower case). Any number of hexadecimal  digits  may  appear
3451           between  \x{  and  },  but the value of the character code must be less
3452         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3453         the maximum value in hexadecimal is 7FFFFFFF. Note that this is  bigger         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
3454         than the largest Unicode code point, which is 10FFFF.         than the largest Unicode code point, which is 10FFFF.
3455    
3456         If  characters  other than hexadecimal digits appear between \x{ and },         If characters other than hexadecimal digits appear between \x{  and  },
3457         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
3458         Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal         Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
3459         escape, with no following digits, giving a  character  whose  value  is         escape,  with  no  following  digits, giving a character whose value is
3460         zero.         zero.
3461    
3462         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3463         two syntaxes for \x. There is no difference in the way  they  are  han-         two  syntaxes  for  \x. There is no difference in the way they are han-
3464         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
3465    
3466         After  \0  up  to two further octal digits are read. If there are fewer         After \0 up to two further octal digits are read. If  there  are  fewer
3467         than two digits, just  those  that  are  present  are  used.  Thus  the         than  two  digits,  just  those  that  are  present  are used. Thus the
3468         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
3469         (code value 7). Make sure you supply two digits after the initial  zero         (code  value 7). Make sure you supply two digits after the initial zero
3470         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
3471    
3472         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
3473         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
3474         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
3475         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
3476         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
3477         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
3478         of parenthesized subpatterns.         of parenthesized subpatterns.
3479    
3480         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
3481         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
3482         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
3483         erate a data character. Any subsequent digits stand for themselves.  In         erate  a data character. Any subsequent digits stand for themselves. In
3484         non-UTF-8  mode,  the  value  of a character specified in octal must be         non-UTF-8 mode, the value of a character specified  in  octal  must  be
3485         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
3486         example:         example:
3487    
3488           \040   is another way of writing a space           \040   is another way of writing a space
# Line 3405  BACKSLASH Line 3500  BACKSLASH
3500           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
3501                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
3502    
3503         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
3504         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
3505    
3506         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
3507         inside  and  outside character classes. In addition, inside a character         inside and outside character classes. In addition, inside  a  character
3508         class, the sequence \b is interpreted as the backspace  character  (hex         class,  the  sequence \b is interpreted as the backspace character (hex
3509         08).  The sequences \B, \N, \R, and \X are not special inside a charac-         08). The sequences \B, \N, \R, and \X are not special inside a  charac-
3510         ter class. Like any  other  unrecognized  escape  sequences,  they  are         ter  class.  Like  any  other  unrecognized  escape sequences, they are
3511         treated  as  the  literal characters "B", "N", "R", and "X" by default,         treated as the literal characters "B", "N", "R", and  "X"  by  default,
3512         but cause an error if the PCRE_EXTRA option is set. Outside a character         but cause an error if the PCRE_EXTRA option is set. Outside a character
3513         class, these sequences have different meanings.         class, these sequences have different meanings.
3514    
3515     Absolute and relative back references     Absolute and relative back references
3516    
3517         The  sequence  \g followed by an unsigned or a negative number, option-         The sequence \g followed by an unsigned or a negative  number,  option-
3518         ally enclosed in braces, is an absolute or relative back  reference.  A         ally  enclosed  in braces, is an absolute or relative back reference. A
3519         named back reference can be coded as \g{name}. Back references are dis-         named back reference can be coded as \g{name}. Back references are dis-
3520         cussed later, following the discussion of parenthesized subpatterns.         cussed later, following the discussion of parenthesized subpatterns.
3521    
3522     Absolute and relative subroutine calls     Absolute and relative subroutine calls
3523    
3524         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
3525         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
3526         an alternative syntax for referencing a subpattern as  a  "subroutine".         an  alternative  syntax for referencing a subpattern as a "subroutine".
3527         Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and         Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
3528         \g<...> (Oniguruma syntax) are not synonymous. The  former  is  a  back         \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
3529         reference; the latter is a subroutine call.         reference; the latter is a subroutine call.
3530    
3531     Generic character types     Generic character types
# Line 3449  BACKSLASH Line 3544  BACKSLASH
3544           \W     any "non-word" character           \W     any "non-word" character
3545    
3546         There is also the single sequence \N, which matches a non-newline char-         There is also the single sequence \N, which matches a non-newline char-
3547         acter.  This is the same as the "." metacharacter when  PCRE_DOTALL  is         acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
3548         not set.         not set.
3549    
3550         Each  pair of lower and upper case escape sequences partitions the com-         Each pair of lower and upper case escape sequences partitions the  com-
3551         plete set of characters into two disjoint  sets.  Any  given  character         plete  set  of  characters  into two disjoint sets. Any given character
3552         matches  one, and only one, of each pair. The sequences can appear both         matches one, and only one, of each pair. The sequences can appear  both
3553         inside and outside character classes. They each match one character  of         inside  and outside character classes. They each match one character of
3554         the  appropriate  type.  If the current matching point is at the end of         the appropriate type. If the current matching point is at  the  end  of
3555         the subject string, all of them fail, because there is no character  to         the  subject string, all of them fail, because there is no character to
3556         match.         match.
3557    
3558         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
3559         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
3560         characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
3561         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
3562         ter. In PCRE, it never does.         ter. In PCRE, it never does.
3563    
3564         A  "word"  character is an underscore or any character that is a letter         A "word" character is an underscore or any character that is  a  letter
3565         or digit.  By default, the definition of letters  and  digits  is  con-         or  digit.   By  default,  the definition of letters and digits is con-
3566         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
3567         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3568         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3569         systems, or "french" in Windows, some character codes greater than  128         systems,  or "french" in Windows, some character codes greater than 128
3570         are  used  for  accented letters, and these are then matched by \w. The         are used for accented letters, and these are then matched  by  \w.  The
3571         use of locales with Unicode is discouraged.         use of locales with Unicode is discouraged.
3572    
3573         By default, in UTF-8 mode, characters  with  values  greater  than  128         By  default,  in  UTF-8  mode,  characters with values greater than 128
3574         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These         never match \d, \s, or \w, and always  match  \D,  \S,  and  \W.  These
3575         sequences retain their original meanings from before UTF-8 support  was         sequences  retain their original meanings from before UTF-8 support was
3576         available,  mainly for efficiency reasons. However, if PCRE is compiled         available, mainly for efficiency reasons. However, if PCRE is  compiled
3577         with Unicode property support, and the PCRE_UCP option is set, the  be-         with  Unicode property support, and the PCRE_UCP option is set, the be-
3578         haviour  is  changed  so  that Unicode properties are used to determine         haviour is changed so that Unicode properties  are  used  to  determine
3579         character types, as follows:         character types, as follows:
3580    
3581           \d  any character that \p{Nd} matches (decimal digit)           \d  any character that \p{Nd} matches (decimal digit)
3582           \s  any character that \p{Z} matches, plus HT, LF, FF, CR           \s  any character that \p{Z} matches, plus HT, LF, FF, CR
3583           \w  any character that \p{L} or \p{N} matches, plus underscore           \w  any character that \p{L} or \p{N} matches, plus underscore
3584    
3585         The upper case escapes match the inverse sets of characters. Note  that         The  upper case escapes match the inverse sets of characters. Note that
3586         \d  matches  only decimal digits, whereas \w matches any Unicode digit,         \d matches only decimal digits, whereas \w matches any  Unicode  digit,
3587         as well as any Unicode letter, and underscore. Note also that  PCRE_UCP         as  well as any Unicode letter, and underscore. Note also that PCRE_UCP
3588         affects  \b,  and  \B  because  they are defined in terms of \w and \W.         affects \b, and \B because they are defined in  terms  of  \w  and  \W.
3589         Matching these sequences is noticeably slower when PCRE_UCP is set.         Matching these sequences is noticeably slower when PCRE_UCP is set.
3590    
3591         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to         The  sequences  \h, \H, \v, and \V are features that were added to Perl
3592         the  other  sequences,  which  match  only ASCII characters by default,         at release 5.10. In contrast to the other sequences, which  match  only
3593         these always  match  certain  high-valued  codepoints  in  UTF-8  mode,         ASCII  characters  by  default,  these always match certain high-valued
3594         whether or not PCRE_UCP is set. The horizontal space characters are:         codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The  horizon-
3595           tal space characters are:
3596    
3597           U+0009     Horizontal tab           U+0009     Horizontal tab
3598           U+0020     Space           U+0020     Space
# Line 3531  BACKSLASH Line 3627  BACKSLASH
3627     Newline sequences     Newline sequences
3628    
3629         Outside  a  character class, by default, the escape sequence \R matches         Outside  a  character class, by default, the escape sequence \R matches
3630         any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
3631         mode \R is equivalent to the following:         following:
3632    
3633           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3634    
# Line 3740  BACKSLASH Line 3836  BACKSLASH
3836    
3837     Resetting the match start     Resetting the match start
3838    
3839         The escape sequence \K, which is a Perl 5.10 feature, causes any previ-         The  escape sequence \K causes any previously matched characters not to
3840         ously matched characters not  to  be  included  in  the  final  matched         be included in the final matched sequence. For example, the pattern:
        sequence. For example, the pattern:  
3841    
3842           foo\Kbar           foo\Kbar
3843    
3844         matches  "foobar",  but reports that it has matched "bar". This feature         matches "foobar", but reports that it has matched "bar".  This  feature
3845         is similar to a lookbehind assertion (described  below).   However,  in         is  similar  to  a lookbehind assertion (described below).  However, in
3846         this  case, the part of the subject before the real match does not have         this case, the part of the subject before the real match does not  have
3847         to be of fixed length, as lookbehind assertions do. The use of \K  does         to  be of fixed length, as lookbehind assertions do. The use of \K does
3848         not  interfere  with  the setting of captured substrings.  For example,         not interfere with the setting of captured  substrings.   For  example,
3849         when the pattern         when the pattern
3850    
3851           (foo)\Kbar           (foo)\Kbar
3852    
3853         matches "foobar", the first substring is still set to "foo".         matches "foobar", the first substring is still set to "foo".
3854    
3855         Perl documents that the use  of  \K  within  assertions  is  "not  well         Perl  documents  that  the  use  of  \K  within assertions is "not well
3856         defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive         defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
3857         assertions, but is ignored in negative assertions.         assertions, but is ignored in negative assertions.
3858    
3859     Simple assertions     Simple assertions
3860    
3861         The final use of backslash is for certain simple assertions. An  asser-         The  final use of backslash is for certain simple assertions. An asser-
3862         tion  specifies a condition that has to be met at a particular point in         tion specifies a condition that has to be met at a particular point  in
3863         a match, without consuming any characters from the subject string.  The         a  match, without consuming any characters from the subject string. The
3864         use  of subpatterns for more complicated assertions is described below.         use of subpatterns for more complicated assertions is described  below.
3865         The backslashed assertions are:         The backslashed assertions are:
3866    
3867           \b     matches at a word boundary           \b     matches at a word boundary
# Line 3777  BACKSLASH Line 3872  BACKSLASH
3872           \z     matches only at the end of the subject           \z     matches only at the end of the subject
3873           \G     matches at the first matching position in the subject           \G     matches at the first matching position in the subject
3874    
3875         Inside a character class, \b has a different meaning;  it  matches  the         Inside  a  character  class, \b has a different meaning; it matches the
3876         backspace  character.  If  any  other  of these assertions appears in a         backspace character. If any other of  these  assertions  appears  in  a
3877         character class, by default it matches the corresponding literal  char-         character  class, by default it matches the corresponding literal char-
3878         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
3879         PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-         PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-
3880         ated instead.         ated instead.
3881    
3882         A  word  boundary is a position in the subject string where the current         A word boundary is a position in the subject string where  the  current
3883         character and the previous character do not both match \w or  \W  (i.e.         character  and  the previous character do not both match \w or \W (i.e.
3884         one  matches  \w  and the other matches \W), or the start or end of the         one matches \w and the other matches \W), or the start or  end  of  the
3885         string if the first or last  character  matches  \w,  respectively.  In         string  if  the  first  or  last character matches \w, respectively. In
3886         UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the         UTF-8 mode, the meanings of \w and \W can be  changed  by  setting  the
3887         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither         PCRE_UCP  option. When this is done, it also affects \b and \B. Neither
3888         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-         PCRE nor Perl has a separate "start of word" or "end of  word"  metase-
3889         quence. However, whatever follows \b normally determines which  it  is.         quence.  However,  whatever follows \b normally determines which it is.
3890         For example, the fragment \ba matches "a" at the start of a word.         For example, the fragment \ba matches "a" at the start of a word.
3891    
3892         The  \A,  \Z,  and \z assertions differ from the traditional circumflex         The \A, \Z, and \z assertions differ from  the  traditional  circumflex
3893         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
3894         at  the  very start and end of the subject string, whatever options are         at the very start and end of the subject string, whatever  options  are
3895         set. Thus, they are independent of multiline mode. These  three  asser-         set.  Thus,  they are independent of multiline mode. These three asser-
3896         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
3897         affect only the behaviour of the circumflex and dollar  metacharacters.         affect  only the behaviour of the circumflex and dollar metacharacters.
3898         However,  if the startoffset argument of pcre_exec() is non-zero, indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
3899         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
3900         the  subject,  \A  can never match. The difference between \Z and \z is         the subject, \A can never match. The difference between \Z  and  \z  is
3901         that \Z matches before a newline at the end of the string as well as at         that \Z matches before a newline at the end of the string as well as at
3902         the very end, whereas \z matches only at the end.         the very end, whereas \z matches only at the end.
3903    
3904         The  \G assertion is true only when the current matching position is at         The \G assertion is true only when the current matching position is  at
3905         the start point of the match, as specified by the startoffset  argument         the  start point of the match, as specified by the startoffset argument
3906         of  pcre_exec().  It  differs  from \A when the value of startoffset is         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
3907         non-zero. By calling pcre_exec() multiple times with appropriate  argu-         non-zero.  By calling pcre_exec() multiple times with appropriate argu-
3908         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
3909         mentation where \G can be useful.         mentation where \G can be useful.
3910    
3911         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
3912         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
3913         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
3914         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
3915         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
3916    
3917         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
3918         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
3919         in the compiled regular expression.         in the compiled regular expression.
3920    
# Line 3827  BACKSLASH Line 3922  BACKSLASH
3922  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
3923    
3924         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
3925         character  is  an  assertion  that is true only if the current matching         character is an assertion that is true only  if  the  current  matching
3926         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
3927         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
3928         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
3929         has an entirely different meaning (see below).         has an entirely different meaning (see below).
3930    
3931         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
3932         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
3933         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
3934         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
3935         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
3936         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
3937         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
3938    
3939         A  dollar  character  is  an assertion that is true only if the current         A dollar character is an assertion that is true  only  if  the  current
3940         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
3941         before a newline at the end of the string (by default). Dollar need not         before a newline at the end of the string (by default). Dollar need not
3942         be the last character of the pattern if a number  of  alternatives  are         be  the  last  character of the pattern if a number of alternatives are
3943         involved,  but  it  should  be  the last item in any branch in which it         involved, but it should be the last item in  any  branch  in  which  it
3944         appears. Dollar has no special meaning in a character class.         appears. Dollar has no special meaning in a character class.
3945    
3946         The meaning of dollar can be changed so that it  matches  only  at  the         The  meaning  of  dollar  can be changed so that it matches only at the
3947         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
3948         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
3949    
3950         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
3951         PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
3952         matches immediately after internal newlines as well as at the start  of         matches  immediately after internal newlines as well as at the start of
3953         the  subject  string.  It  does not match after a newline that ends the         the subject string. It does not match after a  newline  that  ends  the
3954         string. A dollar matches before any newlines in the string, as well  as         string.  A dollar matches before any newlines in the string, as well as
3955         at  the very end, when PCRE_MULTILINE is set. When newline is specified         at the very end, when PCRE_MULTILINE is set. When newline is  specified
3956         as the two-character sequence CRLF, isolated CR and  LF  characters  do         as  the  two-character  sequence CRLF, isolated CR and LF characters do
3957         not indicate newlines.         not indicate newlines.
3958    
3959         For  example, the pattern /^abc$/ matches the subject string "def\nabc"         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
3960         (where \n represents a newline) in multiline mode, but  not  otherwise.         (where  \n  represents a newline) in multiline mode, but not otherwise.
3961         Consequently,  patterns  that  are anchored in single line mode because         Consequently, patterns that are anchored in single  line  mode  because
3962         all branches start with ^ are not anchored in  multiline  mode,  and  a         all  branches  start  with  ^ are not anchored in multiline mode, and a
3963         match  for  circumflex  is  possible  when  the startoffset argument of         match for circumflex is  possible  when  the  startoffset  argument  of
3964         pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
3965         PCRE_MULTILINE is set.         PCRE_MULTILINE is set.
3966    
3967         Note  that  the sequences \A, \Z, and \z can be used to match the start         Note that the sequences \A, \Z, and \z can be used to match  the  start
3968         and end of the subject in both modes, and if all branches of a  pattern         and  end of the subject in both modes, and if all branches of a pattern
3969         start  with  \A it is always anchored, whether or not PCRE_MULTILINE is         start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
3970         set.         set.
3971    
3972    
3973  FULL STOP (PERIOD, DOT) AND \N  FULL STOP (PERIOD, DOT) AND \N
3974    
3975         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
3976         ter  in  the subject string except (by default) a character that signi-         ter in the subject string except (by default) a character  that  signi-
3977         fies the end of a line. In UTF-8 mode, the  matched  character  may  be         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3978         more than one byte long.         more than one byte long.
3979    
3980         When  a line ending is defined as a single character, dot never matches         When a line ending is defined as a single character, dot never  matches
3981         that character; when the two-character sequence CRLF is used, dot  does         that  character; when the two-character sequence CRLF is used, dot does
3982         not  match  CR  if  it  is immediately followed by LF, but otherwise it         not match CR if it is immediately followed  by  LF,  but  otherwise  it
3983         matches all characters (including isolated CRs and LFs). When any  Uni-         matches  all characters (including isolated CRs and LFs). When any Uni-
3984         code  line endings are being recognized, dot does not match CR or LF or         code line endings are being recognized, dot does not match CR or LF  or
3985         any of the other line ending characters.         any of the other line ending characters.
3986    
3987         The behaviour of dot with regard to newlines can  be  changed.  If  the         The  behaviour  of  dot  with regard to newlines can be changed. If the
3988         PCRE_DOTALL  option  is  set,  a dot matches any one character, without         PCRE_DOTALL option is set, a dot matches  any  one  character,  without
3989         exception. If the two-character sequence CRLF is present in the subject         exception. If the two-character sequence CRLF is present in the subject
3990         string, it takes two dots to match it.         string, it takes two dots to match it.
3991    
3992         The  handling of dot is entirely independent of the handling of circum-         The handling of dot is entirely independent of the handling of  circum-
3993         flex and dollar, the only relationship being  that  they  both  involve         flex  and  dollar,  the  only relationship being that they both involve
3994         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
3995    
3996         The escape sequence \N always behaves as a dot does when PCRE_DOTALL is         The escape sequence \N behaves like  a  dot,  except  that  it  is  not
3997         not set. In other words, it matches any one character except  one  that         affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
3998         signifies the end of a line.         character except one that signifies the end of a line.
3999    
4000    
4001  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
4002    
4003         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
4004         both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any         both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
4005         line-ending  characters.  The  feature  is provided in Perl in order to         line-ending characters. The feature is provided in  Perl  in  order  to
4006         match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-         match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
4007         acters  into individual bytes, what remains in the string may be a mal-         acters into individual bytes, the rest of the string may start  with  a
4008         formed UTF-8 string. For this reason, the \C escape  sequence  is  best         malformed  UTF-8  character. For this reason, the \C escape sequence is
4009         avoided.         best avoided.
4010    
4011         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
4012         below), because in UTF-8 mode this would make it impossible  to  calcu-         below),  because  in UTF-8 mode this would make it impossible to calcu-
4013         late the length of the lookbehind.         late the length of the lookbehind.
4014    
4015    
# Line 3924  SQUARE BRACKETS AND CHARACTER CLASSES Line 4019  SQUARE BRACKETS AND CHARACTER CLASSES
4019         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
4020         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4021         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
4022         square  bracket  is required as a member of the class, it should be the         square bracket is required as a member of the class, it should  be  the
4023         first data character in the class  (after  an  initial  circumflex,  if         first  data  character  in  the  class (after an initial circumflex, if
4024         present) or escaped with a backslash.         present) or escaped with a backslash.
4025    
4026         A  character  class matches a single character in the subject. In UTF-8         A character class matches a single character in the subject.  In  UTF-8
4027         mode, the character may be more than one byte long. A matched character         mode, the character may be more than one byte long. A matched character
4028         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
4029         character in the class definition is a circumflex, in  which  case  the         character  in  the  class definition is a circumflex, in which case the
4030         subject  character  must  not  be in the set defined by the class. If a         subject character must not be in the set defined by  the  class.  If  a
4031         circumflex is actually required as a member of the class, ensure it  is         circumflex  is actually required as a member of the class, ensure it is
4032         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
4033    
4034         For  example, the character class [aeiou] matches any lower case vowel,         For example, the character class [aeiou] matches any lower case  vowel,
4035         while [^aeiou] matches any character that is not a  lower  case  vowel.         while  [^aeiou]  matches  any character that is not a lower case vowel.
4036         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
4037         characters that are in the class by enumerating those that are  not.  A         characters  that  are in the class by enumerating those that are not. A
4038         class  that starts with a circumflex is not an assertion; it still con-         class that starts with a circumflex is not an assertion; it still  con-
4039         sumes a character from the subject string, and therefore  it  fails  if         sumes  a  character  from the subject string, and therefore it fails if
4040         the current pointer is at the end of the string.         the current pointer is at the end of the string.
4041    
4042         In  UTF-8 mode, characters with values greater than 255 can be included         In UTF-8 mode, characters with values greater than 255 can be  included
4043         in a class as a literal string of bytes, or by using the  \x{  escaping         in  a  class as a literal string of bytes, or by using the \x{ escaping
4044         mechanism.         mechanism.
4045    
4046         When  caseless  matching  is set, any letters in a class represent both         When caseless matching is set, any letters in a  class  represent  both
4047         their upper case and lower case versions, so for  example,  a  caseless         their  upper  case  and lower case versions, so for example, a caseless
4048         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
4049         match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
4050         understands  the  concept  of case for characters whose values are less         understands the concept of case for characters whose  values  are  less
4051         than 128, so caseless matching is always possible. For characters  with         than  128, so caseless matching is always possible. For characters with
4052         higher  values,  the  concept  of case is supported if PCRE is compiled         higher values, the concept of case is supported  if  PCRE  is  compiled
4053         with Unicode property support, but not otherwise.  If you want  to  use         with  Unicode  property support, but not otherwise.  If you want to use
4054         caseless  matching  in UTF8-mode for characters 128 and above, you must         caseless matching in UTF8-mode for characters 128 and above,  you  must
4055         ensure that PCRE is compiled with Unicode property support as  well  as         ensure  that  PCRE is compiled with Unicode property support as well as
4056         with UTF-8 support.         with UTF-8 support.
4057    
4058         Characters  that  might  indicate  line breaks are never treated in any         Characters that might indicate line breaks are  never  treated  in  any
4059         special way  when  matching  character  classes,  whatever  line-ending         special  way  when  matching  character  classes,  whatever line-ending
4060         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
4061         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
4062         of these characters.         of these characters.
4063    
4064         The  minus (hyphen) character can be used to specify a range of charac-         The minus (hyphen) character can be used to specify a range of  charac-
4065         ters in a character  class.  For  example,  [d-m]  matches  any  letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
4066         between  d  and  m,  inclusive.  If  a minus character is required in a         between d and m, inclusive. If a  minus  character  is  required  in  a
4067         class, it must be escaped with a backslash  or  appear  in  a  position         class,  it  must  be  escaped  with a backslash or appear in a position
4068         where  it cannot be interpreted as indicating a range, typically as the         where it cannot be interpreted as indicating a range, typically as  the
4069         first or last character in the class.         first or last character in the class.
4070    
4071         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
4072         ter  of a range. A pattern such as [W-]46] is interpreted as a class of         ter of a range. A pattern such as [W-]46] is interpreted as a class  of
4073         two characters ("W" and "-") followed by a literal string "46]", so  it         two  characters ("W" and "-") followed by a literal string "46]", so it
4074         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
4075         backslash it is interpreted as the end of range, so [W-\]46] is  inter-         backslash  it is interpreted as the end of range, so [W-\]46] is inter-
4076         preted  as a class containing a range followed by two other characters.         preted as a class containing a range followed by two other  characters.
4077         The octal or hexadecimal representation of "]" can also be used to  end         The  octal or hexadecimal representation of "]" can also be used to end
4078         a range.         a range.
4079    
4080         Ranges  operate in the collating sequence of character values. They can         Ranges operate in the collating sequence of character values. They  can
4081         also  be  used  for  characters  specified  numerically,  for   example         also   be  used  for  characters  specified  numerically,  for  example
4082         [\000-\037].  In UTF-8 mode, ranges can include characters whose values         [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
4083         are greater than 255, for example [\x{100}-\x{2ff}].         are greater than 255, for example [\x{100}-\x{2ff}].
4084    
4085         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
4086         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
4087         to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
4088         character  tables  for  a French locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
4089         accented E characters in both cases. In UTF-8 mode, PCRE  supports  the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
4090         concept  of  case for characters with values greater than 128 only when         concept of case for characters with values greater than 128  only  when
4091         it is compiled with Unicode property support.         it is compiled with Unicode property support.
4092    
4093         The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and  \W         The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
4094         may  also appear in a character class, and add the characters that they         \w, and \W may appear in a character class, and add the characters that
4095         match to the class. For example,  [\dABCDEF]  matches  any  hexadecimal         they  match to the class. For example, [\dABCDEF] matches any hexadeci-
4096         digit.  A circumflex can conveniently be used with the upper case char-         mal digit. In UTF-8 mode, the PCRE_UCP option affects the  meanings  of
4097         acter types to specify a more restricted set  of  characters  than  the         \d,  \s,  \w  and  their upper case partners, just as it does when they
4098         matching  lower  case  type.  For example, the class [^\W_] matches any         appear outside a character class, as described in the section  entitled
4099         letter or digit, but not underscore.         "Generic character types" above. The escape sequence \b has a different
4100           meaning inside a character class; it matches the  backspace  character.
4101         The only metacharacters that are recognized in  character  classes  are         The  sequences  \B,  \N,  \R, and \X are not special inside a character
4102         backslash,  hyphen  (only  where  it can be interpreted as specifying a         class. Like any other unrecognized escape sequences, they  are  treated
4103         range), circumflex (only at the start), opening  square  bracket  (only         as  the literal characters "B", "N", "R", and "X" by default, but cause
4104         when  it can be interpreted as introducing a POSIX class name - see the         an error if the PCRE_EXTRA option is set.
4105         next section), and the terminating  closing  square  bracket.  However,  
4106           A circumflex can conveniently be used with  the  upper  case  character
4107           types  to specify a more restricted set of characters than the matching
4108           lower case type.  For example, the class [^\W_] matches any  letter  or
4109           digit, but not underscore, whereas [\w] includes underscore. A positive
4110           character class should be read as "something OR something OR ..." and a
4111           negative class as "NOT something AND NOT something AND NOT ...".
4112    
4113           The  only  metacharacters  that are recognized in character classes are
4114           backslash, hyphen (only where it can be  interpreted  as  specifying  a
4115           range),  circumflex  (only  at the start), opening square bracket (only
4116           when it can be interpreted as introducing a POSIX class name - see  the
4117           next  section),  and  the  terminating closing square bracket. However,
4118         escaping other non-alphanumeric characters does no harm.         escaping other non-alphanumeric characters does no harm.
4119    
4120    
4121  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
4122    
4123         Perl supports the POSIX notation for character classes. This uses names         Perl supports the POSIX notation for character classes. This uses names
4124         enclosed by [: and :] within the enclosing square brackets.  PCRE  also         enclosed  by  [: and :] within the enclosing square brackets. PCRE also
4125         supports this notation. For example,         supports this notation. For example,
4126    
4127           [01[:alpha:]%]           [01[:alpha:]%]
# Line 4037  POSIX CHARACTER CLASSES Line 4144  POSIX CHARACTER CLASSES
4144           word     "word" characters (same as \w)           word     "word" characters (same as \w)
4145           xdigit   hexadecimal digits           xdigit   hexadecimal digits
4146    
4147         The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
4148         and  space  (32). Notice that this list includes the VT character (code         and space (32). Notice that this list includes the VT  character  (code
4149         11). This makes "space" different to \s, which does not include VT (for         11). This makes "space" different to \s, which does not include VT (for
4150         Perl compatibility).         Perl compatibility).
4151    
4152         The  name  "word"  is  a Perl extension, and "blank" is a GNU extension         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
4153         from Perl 5.8. Another Perl extension is negation, which  is  indicated         from  Perl  5.8. Another Perl extension is negation, which is indicated
4154         by a ^ character after the colon. For example,         by a ^ character after the colon. For example,
4155    
4156           [12[:^digit:]]           [12[:^digit:]]
4157    
4158         matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the         matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
4159         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4160         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
4161    
4162         By  default,  in UTF-8 mode, characters with values greater than 128 do         By default, in UTF-8 mode, characters with values greater than  128  do
4163         not match any of the POSIX character classes. However, if the  PCRE_UCP         not  match any of the POSIX character classes. However, if the PCRE_UCP
4164         option  is passed to pcre_compile(), some of the classes are changed so         option is passed to pcre_compile(), some of the classes are changed  so
4165         that Unicode character properties are used. This is achieved by replac-         that Unicode character properties are used. This is achieved by replac-
4166         ing the POSIX classes by other sequences, as follows:         ing the POSIX classes by other sequences, as follows:
4167    
# Line 4067  POSIX CHARACTER CLASSES Line 4174  POSIX CHARACTER CLASSES
4174           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
4175           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
4176    
4177         Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other
4178         POSIX classes are unchanged, and match only characters with code points         POSIX classes are unchanged, and match only characters with code points
4179         less than 128.         less than 128.
4180    
4181    
4182  VERTICAL BAR  VERTICAL BAR
4183    
4184         Vertical  bar characters are used to separate alternative patterns. For         Vertical bar characters are used to separate alternative patterns.  For
4185         example, the pattern         example, the pattern
4186    
4187           gilbert|sullivan           gilbert|sullivan
4188    
4189         matches either "gilbert" or "sullivan". Any number of alternatives  may         matches  either "gilbert" or "sullivan". Any number of alternatives may
4190         appear,  and  an  empty  alternative  is  permitted (matching the empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
4191         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
4192         to  right, and the first one that succeeds is used. If the alternatives         to right, and the first one that succeeds is used. If the  alternatives
4193         are within a subpattern (defined below), "succeeds" means matching  the         are  within a subpattern (defined below), "succeeds" means matching the
4194         rest of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
4195    
4196    
4197  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
4198    
4199         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
4200         PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
4201         within  the  pattern  by  a  sequence  of  Perl option letters enclosed         within the pattern by  a  sequence  of  Perl  option  letters  enclosed
4202         between "(?" and ")".  The option letters are         between "(?" and ")".  The option letters are
4203    
4204           i  for PCRE_CASELESS           i  for PCRE_CASELESS
# Line 4101  INTERNAL OPTION SETTING Line 4208  INTERNAL OPTION SETTING
4208    
4209         For example, (?im) sets caseless, multiline matching. It is also possi-         For example, (?im) sets caseless, multiline matching. It is also possi-
4210         ble to unset these options by preceding the letter with a hyphen, and a         ble to unset these options by preceding the letter with a hyphen, and a
4211         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-         combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
4212         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,         LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
4213         is also permitted. If a  letter  appears  both  before  and  after  the         is  also  permitted.  If  a  letter  appears  both before and after the
4214         hyphen, the option is unset.         hyphen, the option is unset.
4215    
4216         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
4217         can be changed in the same way as the Perl-compatible options by  using         can  be changed in the same way as the Perl-compatible options by using
4218         the characters J, U and X respectively.         the characters J, U and X respectively.
4219    
4220         When  one  of  these  option  changes occurs at top level (that is, not         When one of these option changes occurs at  top  level  (that  is,  not
4221         inside subpattern parentheses), the change applies to the remainder  of         inside  subpattern parentheses), the change applies to the remainder of
4222         the pattern that follows. If the change is placed right at the start of         the pattern that follows. If the change is placed right at the start of
4223         a pattern, PCRE extracts it into the global options (and it will there-         a pattern, PCRE extracts it into the global options (and it will there-
4224         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4225    
4226         An  option  change  within a subpattern (see below for a description of         An option change within a subpattern (see below for  a  description  of
4227         subpatterns) affects only that part of the current pattern that follows         subpatterns)  affects only that part of the subpattern that follows it,
4228         it, so         so
4229    
4230           (a(?i)b)c           (a(?i)b)c
4231    
4232         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4233         used).  By this means, options can be made to have  different  settings         used).   By  this means, options can be made to have different settings
4234         in  different parts of the pattern. Any changes made in one alternative         in different parts of the pattern. Any changes made in one  alternative
4235         do carry on into subsequent branches within the  same  subpattern.  For         do  carry  on  into subsequent branches within the same subpattern. For
4236         example,         example,
4237    
4238           (a(?i)b|c)           (a(?i)b|c)
4239    
4240         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
4241         first branch is abandoned before the option setting.  This  is  because         first  branch  is  abandoned before the option setting. This is because
4242         the  effects  of option settings happen at compile time. There would be         the effects of option settings happen at compile time. There  would  be
4243         some very weird behaviour otherwise.         some very weird behaviour otherwise.
4244    
4245         Note: There are other PCRE-specific options that  can  be  set  by  the         Note:  There  are  other  PCRE-specific  options that can be set by the
4246         application  when  the  compile  or match functions are called. In some         application when the compile or match functions  are  called.  In  some
4247         cases the pattern can contain special leading sequences such as (*CRLF)         cases the pattern can contain special leading sequences such as (*CRLF)
4248         to  override  what  the application has set or what has been defaulted.         to override what the application has set or what  has  been  defaulted.
4249         Details are given in the section entitled  "Newline  sequences"  above.         Details  are  given  in the section entitled "Newline sequences" above.
4250         There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be         There are also the (*UTF8) and (*UCP) leading  sequences  that  can  be
4251         used to set UTF-8 and Unicode property modes; they  are  equivalent  to         used  to  set  UTF-8 and Unicode property modes; they are equivalent to
4252         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
4253    
4254    
# Line 4154  SUBPATTERNS Line 4261  SUBPATTERNS
4261    
4262           cat(aract|erpillar|)           cat(aract|erpillar|)
4263    
4264         matches one of the words "cat", "cataract", or  "caterpillar".  Without         matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
4265         the  parentheses,  it  would  match  "cataract", "erpillar" or an empty         it would match "cataract", "erpillar" or an empty string.
        string.  
4266    
4267         2. It sets up the subpattern as  a  capturing  subpattern.  This  means         2. It sets up the subpattern as  a  capturing  subpattern.  This  means
4268         that,  when  the  whole  pattern  matches,  that portion of the subject         that,  when  the  whole  pattern  matches,  that portion of the subject
4269         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4270         ovector  argument  of pcre_exec(). Opening parentheses are counted from         ovector  argument  of pcre_exec(). Opening parentheses are counted from
4271         left to right (starting from 1) to obtain  numbers  for  the  capturing         left to right (starting from 1) to obtain  numbers  for  the  capturing
4272         subpatterns.         subpatterns.  For  example,  if  the  string  "the red king" is matched
4273           against the pattern
        For  example,  if the string "the red king" is matched against the pat-  
        tern  
4274    
4275           the ((red|white) (king|queen))           the ((red|white) (king|queen))
4276    
# Line 4215  DUPLICATE SUBPATTERN NUMBERS Line 4319  DUPLICATE SUBPATTERN NUMBERS
4319         matched. This construct is useful when you want to  capture  part,  but         matched. This construct is useful when you want to  capture  part,  but
4320         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
4321         theses are numbered as usual, but the number is reset at the  start  of         theses are numbered as usual, but the number is reset at the  start  of
4322         each  branch. The numbers of any capturing buffers that follow the sub-         each  branch.  The numbers of any capturing parentheses that follow the
4323         pattern start after the highest number used in any branch. The  follow-         subpattern start after the highest number used in any branch. The  fol-
4324         ing  example  is taken from the Perl documentation.  The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
4325         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
4326    
4327           # before  ---------------branch-reset----------- after           # before  ---------------branch-reset----------- after
# Line 4324  REPETITION Line 4428  REPETITION
4428           the \C escape sequence           the \C escape sequence
4429           the \X escape sequence (in UTF-8 mode with Unicode properties)           the \X escape sequence (in UTF-8 mode with Unicode properties)
4430           the \R escape sequence           the \R escape sequence
4431           an escape such as \d that matches a single character           an escape such as \d or \pL that matches a single character
4432           a character class           a character class
4433           a back reference (see next section)           a back reference (see next section)
4434           a parenthesized subpattern (unless it is an assertion)           a parenthesized subpattern (unless it is an assertion)
# Line 4364  REPETITION Line 4468  REPETITION
4468         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
4469         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
4470         ful for subpatterns that are referenced as subroutines  from  elsewhere         ful for subpatterns that are referenced as subroutines  from  elsewhere
4471         in the pattern. Items other than subpatterns that have a {0} quantifier         in the pattern (but see also the section entitled "Defining subpatterns
4472         are omitted from the compiled pattern.         for use by reference only" below). Items other  than  subpatterns  that
4473           have a {0} quantifier are omitted from the compiled pattern.
4474    
4475         For convenience, the three most common quantifiers have  single-charac-         For  convenience, the three most common quantifiers have single-charac-
4476         ter abbreviations:         ter abbreviations:
4477    
4478           *    is equivalent to {0,}           *    is equivalent to {0,}
4479           +    is equivalent to {1,}           +    is equivalent to {1,}
4480           ?    is equivalent to {0,1}           ?    is equivalent to {0,1}
4481    
4482         It  is  possible  to construct infinite loops by following a subpattern         It is possible to construct infinite loops by  following  a  subpattern
4483         that can match no characters with a quantifier that has no upper limit,         that can match no characters with a quantifier that has no upper limit,
4484         for example:         for example:
4485    
4486           (a?)*           (a?)*
4487    
4488         Earlier versions of Perl and PCRE used to give an error at compile time         Earlier versions of Perl and PCRE used to give an error at compile time
4489         for such patterns. However, because there are cases where this  can  be         for  such  patterns. However, because there are cases where this can be
4490         useful,  such  patterns  are now accepted, but if any repetition of the         useful, such patterns are now accepted, but if any  repetition  of  the
4491         subpattern does in fact match no characters, the loop is forcibly  bro-         subpattern  does in fact match no characters, the loop is forcibly bro-
4492         ken.         ken.
4493    
4494         By  default,  the quantifiers are "greedy", that is, they match as much         By default, the quantifiers are "greedy", that is, they match  as  much
4495         as possible (up to the maximum  number  of  permitted  times),  without         as  possible  (up  to  the  maximum number of permitted times), without
4496         causing  the  rest of the pattern to fail. The classic example of where         causing the rest of the pattern to fail. The classic example  of  where
4497         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
4498         appear  between  /*  and  */ and within the comment, individual * and /         appear between /* and */ and within the comment,  individual  *  and  /
4499         characters may appear. An attempt to match C comments by  applying  the         characters  may  appear. An attempt to match C comments by applying the
4500         pattern         pattern
4501    
4502           /\*.*\*/           /\*.*\*/
# Line 4400  REPETITION Line 4505  REPETITION
4505    
4506           /* first comment */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
4507    
4508         fails,  because it matches the entire string owing to the greediness of         fails, because it matches the entire string owing to the greediness  of
4509         the .*  item.         the .*  item.
4510    
4511         However, if a quantifier is followed by a question mark, it  ceases  to         However,  if  a quantifier is followed by a question mark, it ceases to
4512         be greedy, and instead matches the minimum number of times possible, so         be greedy, and instead matches the minimum number of times possible, so
4513         the pattern         the pattern
4514    
4515           /\*.*?\*/           /\*.*?\*/
4516    
4517         does the right thing with the C comments. The meaning  of  the  various         does  the  right  thing with the C comments. The meaning of the various
4518         quantifiers  is  not  otherwise  changed,  just the preferred number of         quantifiers is not otherwise changed,  just  the  preferred  number  of
4519         matches.  Do not confuse this use of question mark with its  use  as  a         matches.   Do  not  confuse this use of question mark with its use as a
4520         quantifier  in its own right. Because it has two uses, it can sometimes         quantifier in its own right. Because it has two uses, it can  sometimes
4521         appear doubled, as in         appear doubled, as in
4522    
4523           \d??\d           \d??\d
# Line 4420  REPETITION Line 4525  REPETITION
4525         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
4526         only way the rest of the pattern matches.         only way the rest of the pattern matches.
4527    
4528         If  the PCRE_UNGREEDY option is set (an option that is not available in         If the PCRE_UNGREEDY option is set (an option that is not available  in
4529         Perl), the quantifiers are not greedy by default, but  individual  ones         Perl),  the  quantifiers are not greedy by default, but individual ones
4530         can  be  made  greedy  by following them with a question mark. In other         can be made greedy by following them with a  question  mark.  In  other
4531         words, it inverts the default behaviour.         words, it inverts the default behaviour.
4532    
4533         When a parenthesized subpattern is quantified  with  a  minimum  repeat         When  a  parenthesized  subpattern  is quantified with a minimum repeat
4534         count  that is greater than 1 or with a limited maximum, more memory is         count that is greater than 1 or with a limited maximum, more memory  is
4535         required for the compiled pattern, in proportion to  the  size  of  the         required  for  the  compiled  pattern, in proportion to the size of the
4536         minimum or maximum.         minimum or maximum.
4537    
4538         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
4539         alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
4540         the  pattern  is  implicitly anchored, because whatever follows will be         the pattern is implicitly anchored, because whatever  follows  will  be
4541         tried against every character position in the subject string, so  there         tried  against every character position in the subject string, so there
4542         is  no  point  in  retrying the overall match at any position after the         is no point in retrying the overall match at  any  position  after  the
4543         first. PCRE normally treats such a pattern as though it  were  preceded         first.  PCRE  normally treats such a pattern as though it were preceded
4544         by \A.         by \A.
4545    
4546         In  cases  where  it  is known that the subject string contains no new-         In cases where it is known that the subject  string  contains  no  new-
4547         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
4548         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
4549    
4550         However,  there is one situation where the optimization cannot be used.         However, there is one situation where the optimization cannot be  used.
4551         When .*  is inside capturing parentheses that are the subject of a back         When .*  is inside capturing parentheses that are the subject of a back
4552         reference elsewhere in the pattern, a match at the start may fail where         reference elsewhere in the pattern, a match at the start may fail where
4553         a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
4554    
4555           (.*)abc\1           (.*)abc\1
4556    
4557         If the subject is "xyz123abc123" the match point is the fourth  charac-         If  the subject is "xyz123abc123" the match point is the fourth charac-
4558         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
4559    
4560         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 4458  REPETITION Line 4563  REPETITION
4563           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
4564    
4565         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
4566         is  "tweedledee".  However,  if there are nested capturing subpatterns,         is "tweedledee". However, if there are  nested  capturing  subpatterns,
4567         the corresponding captured values may have been set in previous  itera-         the  corresponding captured values may have been set in previous itera-
4568         tions. For example, after         tions. For example, after
4569    
4570           /(a|(b))+/           /(a|(b))+/
# Line 4469  REPETITION Line 4574  REPETITION
4574    
4575  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
4576    
4577         With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
4578         repetition, failure of what follows normally causes the  repeated  item         repetition,  failure  of what follows normally causes the repeated item
4579         to  be  re-evaluated to see if a different number of repeats allows the         to be re-evaluated to see if a different number of repeats  allows  the
4580         rest of the pattern to match. Sometimes it is useful to  prevent  this,         rest  of  the pattern to match. Sometimes it is useful to prevent this,
4581         either  to  change the nature of the match, or to cause it fail earlier         either to change the nature of the match, or to cause it  fail  earlier
4582         than it otherwise might, when the author of the pattern knows there  is         than  it otherwise might, when the author of the pattern knows there is
4583         no point in carrying on.         no point in carrying on.
4584    
4585         Consider,  for  example, the pattern \d+foo when applied to the subject         Consider, for example, the pattern \d+foo when applied to  the  subject
4586         line         line
4587    
4588           123456bar           123456bar
4589    
4590         After matching all 6 digits and then failing to match "foo", the normal         After matching all 6 digits and then failing to match "foo", the normal
4591         action  of  the matcher is to try again with only 5 digits matching the         action of the matcher is to try again with only 5 digits  matching  the
4592         \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
4593         "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
4594         the means for specifying that once a subpattern has matched, it is  not         the  means for specifying that once a subpattern has matched, it is not
4595         to be re-evaluated in this way.         to be re-evaluated in this way.
4596    
4597         If  we  use atomic grouping for the previous example, the matcher gives         If we use atomic grouping for the previous example, the  matcher  gives
4598         up immediately on failing to match "foo" the first time.  The  notation         up  immediately  on failing to match "foo" the first time. The notation
4599         is a kind of special parenthesis, starting with (?> as in this example:         is a kind of special parenthesis, starting with (?> as in this example:
4600    
4601           (?>\d+)foo           (?>\d+)foo
4602    
4603         This  kind  of  parenthesis "locks up" the  part of the pattern it con-         This kind of parenthesis "locks up" the  part of the  pattern  it  con-
4604         tains once it has matched, and a failure further into  the  pattern  is         tains  once  it  has matched, and a failure further into the pattern is
4605         prevented  from  backtracking into it. Backtracking past it to previous         prevented from backtracking into it. Backtracking past it  to  previous
4606         items, however, works as normal.         items, however, works as normal.
4607    
4608         An alternative description is that a subpattern of  this  type  matches         An  alternative  description  is that a subpattern of this type matches
4609         the  string  of  characters  that an identical standalone pattern would         the string of characters that an  identical  standalone  pattern  would
4610         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
4611    
4612         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
4613         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
4614         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
4615         pared  to  adjust  the number of digits they match in order to make the         pared to adjust the number of digits they match in order  to  make  the
4616         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
4617         digits.         digits.
4618    
4619         Atomic  groups in general can of course contain arbitrarily complicated         Atomic groups in general can of course contain arbitrarily  complicated
4620         subpatterns, and can be nested. However, when  the  subpattern  for  an         subpatterns,  and  can  be  nested. However, when the subpattern for an
4621         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
4622         simpler notation, called a "possessive quantifier" can  be  used.  This         simpler  notation,  called  a "possessive quantifier" can be used. This
4623         consists  of  an  additional  + character following a quantifier. Using         consists of an additional + character  following  a  quantifier.  Using
4624         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
4625    
4626           \d++foo           \d++foo
# Line 4525  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 4630  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
4630    
4631           (abc|xyz){2,3}+           (abc|xyz){2,3}+
4632    
4633         Possessive   quantifiers   are   always  greedy;  the  setting  of  the         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
4634         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
4635         simpler  forms  of atomic group. However, there is no difference in the         simpler forms of atomic group. However, there is no difference  in  the
4636         meaning of a possessive quantifier and  the  equivalent  atomic  group,         meaning  of  a  possessive  quantifier and the equivalent atomic group,
4637         though  there  may  be a performance difference; possessive quantifiers         though there may be a performance  difference;  possessive  quantifiers
4638         should be slightly faster.         should be slightly faster.
4639    
4640         The possessive quantifier syntax is an extension to the Perl  5.8  syn-         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
4641         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
4642         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
4643         built  Sun's Java package, and PCRE copied it from there. It ultimately         built Sun's Java package, and PCRE copied it from there. It  ultimately
4644         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
4645    
4646         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
4647         ple  pattern  constructs.  For  example, the sequence A+B is treated as         ple pattern constructs. For example, the sequence  A+B  is  treated  as
4648         A++B because there is no point in backtracking into a sequence  of  A's         A++B  because  there is no point in backtracking into a sequence of A's
4649         when B must follow.         when B must follow.
4650    
4651         When  a  pattern  contains an unlimited repeat inside a subpattern that         When a pattern contains an unlimited repeat inside  a  subpattern  that
4652         can itself be repeated an unlimited number of  times,  the  use  of  an         can  itself  be  repeated  an  unlimited number of times, the use of an
4653         atomic  group  is  the  only way to avoid some failing matches taking a         atomic group is the only way to avoid some  failing  matches  taking  a
4654         very long time indeed. The pattern         very long time indeed. The pattern
4655    
4656           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
4657    
4658         matches an unlimited number of substrings that either consist  of  non-         matches  an  unlimited number of substrings that either consist of non-
4659         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
4660         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
4661    
4662           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4663    
4664         it takes a long time before reporting  failure.  This  is  because  the         it  takes  a  long  time  before reporting failure. This is because the
4665         string  can be divided between the internal \D+ repeat and the external         string can be divided between the internal \D+ repeat and the  external
4666         * repeat in a large number of ways, and all  have  to  be  tried.  (The         *  repeat  in  a  large  number of ways, and all have to be tried. (The
4667         example  uses  [!?]  rather than a single character at the end, because         example uses [!?] rather than a single character at  the  end,  because
4668         both PCRE and Perl have an optimization that allows  for  fast  failure         both  PCRE  and  Perl have an optimization that allows for fast failure
4669         when  a single character is used. They remember the last single charac-         when a single character is used. They remember the last single  charac-
4670         ter that is required for a match, and fail early if it is  not  present         ter  that  is required for a match, and fail early if it is not present
4671         in  the  string.)  If  the pattern is changed so that it uses an atomic         in the string.) If the pattern is changed so that  it  uses  an  atomic
4672         group, like this:         group, like this:
4673    
4674           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
# Line 4575  BACK REFERENCES Line 4680  BACK REFERENCES
4680    
4681         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
4682         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
4683         pattern earlier (that is, to its left) in the pattern,  provided  there         pattern  earlier  (that is, to its left) in the pattern, provided there
4684         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
4685    
4686         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
4687         it is always taken as a back reference, and causes  an  error  only  if         it  is  always  taken  as a back reference, and causes an error only if
4688         there  are  not that many capturing left parentheses in the entire pat-         there are not that many capturing left parentheses in the  entire  pat-
4689         tern. In other words, the parentheses that are referenced need  not  be         tern.  In  other words, the parentheses that are referenced need not be
4690         to  the left of the reference for numbers less than 10. A "forward back         to the left of the reference for numbers less than 10. A "forward  back
4691         reference" of this type can make sense when a  repetition  is  involved         reference"  of  this  type can make sense when a repetition is involved
4692         and  the  subpattern to the right has participated in an earlier itera-         and the subpattern to the right has participated in an  earlier  itera-
4693         tion.         tion.
4694    
4695         It is not possible to have a numerical "forward back  reference"  to  a         It  is  not  possible to have a numerical "forward back reference" to a
4696         subpattern  whose  number  is  10  or  more using this syntax because a         subpattern whose number is 10 or  more  using  this  syntax  because  a
4697         sequence such as \50 is interpreted as a character  defined  in  octal.         sequence  such  as  \50 is interpreted as a character defined in octal.
4698         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
4699         details of the handling of digits following a backslash.  There  is  no         details  of  the  handling of digits following a backslash. There is no
4700         such  problem  when named parentheses are used. A back reference to any         such problem when named parentheses are used. A back reference  to  any
4701         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
4702    
4703         Another way of avoiding the ambiguity inherent in  the  use  of  digits         Another  way  of  avoiding  the ambiguity inherent in the use of digits
4704         following a backslash is to use the \g escape sequence, which is a fea-         following a backslash is to use the \g  escape  sequence.  This  escape
4705         ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an         must be followed by an unsigned number or a negative number, optionally
4706         unsigned  number  or  a negative number, optionally enclosed in braces.         enclosed in braces. These examples are all identical:
        These examples are all identical:  
4707    
4708           (ring), \1           (ring), \1
4709           (ring), \g1           (ring), \g1
# Line 4613  BACK REFERENCES Line 4717  BACK REFERENCES
4717           (abc(def)ghi)\g{-1}           (abc(def)ghi)\g{-1}
4718    
4719         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
4720         ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,         ing subpattern before \g, that is, is it equivalent to \2 in this exam-
4721         \g{-2} would be equivalent to \1. The use of relative references can be         ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
4722         helpful in long patterns, and also in  patterns  that  are  created  by         references can be helpful in long patterns, and also in  patterns  that
4723         joining together fragments that contain references within themselves.         are  created  by  joining  together  fragments  that contain references
4724           within themselves.
4725    
4726         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
4727         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
4728         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
4729         of doing that). So the pattern         of doing that). So the pattern
4730    
4731           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4732    
4733         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
4734         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
4735         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
4736         ple,         ple,
4737    
4738           ((?i)rah)\s+\1           ((?i)rah)\s+\1
4739    
4740         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
4741         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
4742    
4743         There are several different ways of writing back  references  to  named         There  are  several  different ways of writing back references to named
4744         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
4745         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
4746         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
4747         and named references, is also supported. We  could  rewrite  the  above         and  named  references,  is  also supported. We could rewrite the above
4748         example in any of the following ways:         example in any of the following ways:
4749    
4750           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 4647  BACK REFERENCES Line 4752  BACK REFERENCES
4752           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4753           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
4754    
4755         A  subpattern  that  is  referenced  by  name may appear in the pattern         A subpattern that is referenced by  name  may  appear  in  the  pattern
4756         before or after the reference.         before or after the reference.
4757    
4758         There may be more than one back reference to the same subpattern. If  a         There  may be more than one back reference to the same subpattern. If a
4759         subpattern  has  not actually been used in a particular match, any back         subpattern has not actually been used in a particular match,  any  back
4760         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
4761    
4762           (a|(bc))\2           (a|(bc))\2
4763    
4764         always fails if it starts to match "a" rather than  "bc".  However,  if         always  fails  if  it starts to match "a" rather than "bc". However, if
4765         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
4766         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
4767    
4768         Because there may be many capturing parentheses in a pattern, all  dig-         Because  there may be many capturing parentheses in a pattern, all dig-
4769         its  following a backslash are taken as part of a potential back refer-         its following a backslash are taken as part of a potential back  refer-
4770         ence number.  If the pattern continues with  a  digit  character,  some         ence  number.   If  the  pattern continues with a digit character, some
4771         delimiter  must  be  used  to  terminate  the  back  reference.  If the         delimiter must  be  used  to  terminate  the  back  reference.  If  the
4772         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
4773         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
4774    
4775     Recursive back references     Recursive back references
4776    
4777         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
4778         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
4779         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
4780         patterns. For example, the pattern         patterns. For example, the pattern
4781    
4782           (a|b\1)+           (a|b\1)+
4783    
4784         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4785         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
4786         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
4787         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
4788         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
4789         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4790    
4791         Back  references of this type cause the group that they reference to be         Back references of this type cause the group that they reference to  be
4792         treated as an atomic group.  Once the whole group has been  matched,  a         treated  as  an atomic group.  Once the whole group has been matched, a
4793         subsequent  matching  failure cannot cause backtracking into the middle         subsequent matching failure cannot cause backtracking into  the  middle
4794         of the group.         of the group.
4795    
4796    
4797  ASSERTIONS  ASSERTIONS
4798    
4799         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
4800         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
4801         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
4802         described above.         described above.
4803    
4804         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
4805         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
4806         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
4807         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
4808         matching position to be changed.         matching position to be changed.
4809    
4810         Assertion  subpatterns  are  not  capturing subpatterns, and may not be         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
4811         repeated, because it makes no sense to assert the  same  thing  several         repeated,  because  it  makes no sense to assert the same thing several
4812         times.  If  any kind of assertion contains capturing subpatterns within         times. If any kind of assertion contains capturing  subpatterns  within
4813         it, these are counted for the purposes of numbering the capturing  sub-         it,  these are counted for the purposes of numbering the capturing sub-
4814         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4815         out only for positive assertions, because it does not  make  sense  for         out  only  for  positive assertions, because it does not make sense for
4816         negative assertions.         negative assertions.
4817    
4818     Lookahead assertions     Lookahead assertions
# Line 4717  ASSERTIONS Line 4822  ASSERTIONS
4822    
4823           \w+(?=;)           \w+(?=;)
4824    
4825         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
4826         colon in the match, and         colon in the match, and
4827    
4828           foo(?!bar)           foo(?!bar)
4829    
4830         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
4831         that the apparently similar pattern         that the apparently similar pattern
4832    
4833           (?!foo)bar           (?!foo)bar
4834    
4835         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
4836         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
4837         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4838         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4839    
4840         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4841         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
4842         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
4843         string must always fail.   The  Perl  5.10  backtracking  control  verb         string must always fail.  The backtracking control verb (*FAIL) or (*F)
4844         (*FAIL) or (*F) is essentially a synonym for (?!).         is a synonym for (?!).
4845    
4846     Lookbehind assertions     Lookbehind assertions
4847    
4848         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
4849         for negative assertions. For example,         for negative assertions. For example,
4850    
4851           (?<!foo)bar           (?<!foo)bar
4852    
4853         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4854         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4855         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4856         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4857         fixed length. Thus         fixed length. Thus
4858    
4859           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 4757  ASSERTIONS Line 4862  ASSERTIONS
4862    
4863           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4864    
4865         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4866         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4867         This is an extension compared with Perl (5.8 and 5.10), which  requires         This is an extension compared with Perl, which requires all branches to
4868         all branches to match the same length of string. An assertion such as         match the same length of string. An assertion such as
4869    
4870           (?<=ab(c|de))           (?<=ab(c|de))
4871    
4872         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
4873         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
4874         top-level branches:         top-level branches:
4875    
4876           (?<=abc|abde)           (?<=abc|abde)
4877    
4878         In some cases, the Perl 5.10 escape sequence \K (see above) can be used         In some cases, the escape sequence \K (see above) can be  used  instead
4879         instead of  a  lookbehind  assertion  to  get  round  the  fixed-length         of a lookbehind assertion to get round the fixed-length restriction.
        restriction.  
4880    
4881         The  implementation  of lookbehind assertions is, for each alternative,         The  implementation  of lookbehind assertions is, for each alternative,
4882         to temporarily move the current position back by the fixed  length  and         to temporarily move the current position back by the fixed  length  and
# Line 4862  CONDITIONAL SUBPATTERNS Line 4966  CONDITIONAL SUBPATTERNS
4966    
4967         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
4968         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4969         tives in the subpattern, a compile-time error occurs.         tives  in  the subpattern, a compile-time error occurs. Each of the two
4970           alternatives may itself contain nested subpatterns of any form, includ-
4971           ing  conditional  subpatterns;  the  restriction  to  two  alternatives
4972           applies only at the level of the condition. This pattern fragment is an
4973           example where the alternatives are complex:
4974    
4975             (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
4976    
4977    
4978         There  are  four  kinds of condition: references to subpatterns, refer-         There  are  four  kinds of condition: references to subpatterns, refer-
4979         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
# Line 4873  CONDITIONAL SUBPATTERNS Line 4984  CONDITIONAL SUBPATTERNS
4984         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
4985         viously matched. If there is more than one  capturing  subpattern  with         viously matched. If there is more than one  capturing  subpattern  with
4986         the  same  number  (see  the earlier section about duplicate subpattern         the  same  number  (see  the earlier section about duplicate subpattern
4987         numbers), the condition is true if any of them have been set. An alter-         numbers), the condition is true if any of them have matched. An  alter-
4988         native  notation is to precede the digits with a plus or minus sign. In         native  notation is to precede the digits with a plus or minus sign. In
4989         this case, the subpattern number is relative rather than absolute.  The         this case, the subpattern number is relative rather than absolute.  The
4990         most  recently opened parentheses can be referenced by (?(-1), the next         most  recently opened parentheses can be referenced by (?(-1), the next
4991         most recent by (?(-2), and so on. In looping  constructs  it  can  also         most recent by (?(-2), and so on. Inside loops it can also  make  sense
4992         make  sense  to  refer  to  subsequent  groups  with constructs such as         to refer to subsequent groups. The next parentheses to be opened can be
4993         (?(+2).         referenced as (?(+1), and so on. (The value zero in any of these  forms
4994           is not used; it provokes a compile-time error.)
4995    
4996         Consider the following pattern, which  contains  non-significant  white         Consider  the  following  pattern, which contains non-significant white
4997         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
4998         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
4999    
5000           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5001    
5002         The first part matches an optional opening  parenthesis,  and  if  that         The  first  part  matches  an optional opening parenthesis, and if that
5003         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5004         ond part matches one or more characters that are not  parentheses.  The         ond  part  matches one or more characters that are not parentheses. The
5005         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether  or  not  the
5006         of parentheses matched or not. If they did, that is, if subject started         first  set  of  parentheses  matched.  If they did, that is, if subject
5007         with an opening parenthesis, the condition is true, and so the yes-pat-         started with an opening parenthesis, the condition is true, and so  the
5008         tern is executed and a  closing  parenthesis  is  required.  Otherwise,         yes-pattern  is  executed and a closing parenthesis is required. Other-
5009         since  no-pattern  is  not  present, the subpattern matches nothing. In         wise, since no-pattern is not present, the subpattern matches  nothing.
5010         other words,  this  pattern  matches  a  sequence  of  non-parentheses,         In  other  words,  this  pattern matches a sequence of non-parentheses,
5011         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5012    
5013         If  you  were  embedding  this pattern in a larger one, you could use a         If you were embedding this pattern in a larger one,  you  could  use  a
5014         relative reference:         relative reference:
5015    
5016           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5017    
5018         This makes the fragment independent of the parentheses  in  the  larger         This  makes  the  fragment independent of the parentheses in the larger
5019         pattern.         pattern.
5020    
5021     Checking for a used subpattern by name     Checking for a used subpattern by name
5022    
5023         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
5024         used subpattern by name. For compatibility  with  earlier  versions  of         used  subpattern  by  name.  For compatibility with earlier versions of
5025         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
5026         also recognized. However, there is a possible ambiguity with this  syn-         also  recognized. However, there is a possible ambiguity with this syn-
5027         tax,  because  subpattern  names  may  consist entirely of digits. PCRE         tax, because subpattern names may  consist  entirely  of  digits.  PCRE
5028         looks first for a named subpattern; if it cannot find one and the  name         looks  first for a named subpattern; if it cannot find one and the name
5029         consists  entirely  of digits, PCRE looks for a subpattern of that num-         consists entirely of digits, PCRE looks for a subpattern of  that  num-
5030         ber, which must be greater than zero. Using subpattern names that  con-         ber,  which must be greater than zero. Using subpattern names that con-
5031         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5032    
5033         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5034    
5035           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5036    
5037         If  the  name used in a condition of this kind is a duplicate, the test         If the name used in a condition of this kind is a duplicate,  the  test
5038         is applied to all subpatterns of the same name, and is true if any  one         is  applied to all subpatterns of the same name, and is true if any one
5039         of them has matched.         of them has matched.
5040    
5041     Checking for pattern recursion     Checking for pattern recursion
5042    
5043         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5044         name R, the condition is true if a recursive call to the whole  pattern         name  R, the condition is true if a recursive call to the whole pattern
5045         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5046         sand follow the letter R, for example:         sand follow the letter R, for example:
5047    
# Line 4937  CONDITIONAL SUBPATTERNS Line 5049  CONDITIONAL SUBPATTERNS
5049    
5050         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5051         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5052         recursion stack. If the name used in a condition  of  this  kind  is  a         recursion  stack.  If  the  name  used in a condition of this kind is a
5053         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5054         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5055    
5056         At "top level", all these recursion test  conditions  are  false.   The         At  "top  level",  all  these recursion test conditions are false.  The
5057         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5058    
5059     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5060    
5061         If  the  condition  is  the string (DEFINE), and there is no subpattern         If the condition is the string (DEFINE), and  there  is  no  subpattern
5062         with the name DEFINE, the condition is  always  false.  In  this  case,         with  the  name  DEFINE,  the  condition is always false. In this case,
5063         there  may  be  only  one  alternative  in the subpattern. It is always         there may be only one alternative  in  the  subpattern.  It  is  always
5064         skipped if control reaches this point  in  the  pattern;  the  idea  of         skipped  if  control  reaches  this  point  in the pattern; the idea of
5065         DEFINE  is that it can be used to define "subroutines" that can be ref-         DEFINE is that it can be used to define "subroutines" that can be  ref-
5066         erenced from elsewhere. (The use of "subroutines" is described  below.)         erenced  from elsewhere. (The use of "subroutines" is described below.)
5067         For  example,  a pattern to match an IPv4 address could be written like         For  example,  a  pattern  to   match   an   IPv4   address   such   as
5068         this (ignore whitespace and line breaks):         "192.168.23.245" could be written like this (ignore whitespace and line
5069           breaks):
5070    
5071           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5072           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
# Line 4987  CONDITIONAL SUBPATTERNS Line 5100  CONDITIONAL SUBPATTERNS
5100    
5101  COMMENTS  COMMENTS
5102    
5103         The  sequence (?# marks the start of a comment that continues up to the         There are two ways of including comments in patterns that are processed
5104         next closing parenthesis. Nested parentheses  are  not  permitted.  The         by PCRE. In both cases, the start of the comment must not be in a char-
5105         characters  that make up a comment play no part in the pattern matching         acter class, nor in the middle of any other sequence of related charac-
5106         at all.         ters such as (?: or a subpattern name or number.  The  characters  that
5107           make up a comment play no part in the pattern matching.
5108    
5109         If the PCRE_EXTENDED option is set, an unescaped # character outside  a         The  sequence (?# marks the start of a comment that continues up to the
5110         character  class  introduces  a  comment  that continues to immediately         next closing parenthesis. Nested parentheses are not permitted. If  the
5111         after the next newline in the pattern.         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5112           comment, which in this case continues to  immediately  after  the  next
5113           newline  character  or character sequence in the pattern. Which charac-
5114           ters are interpreted as newlines is controlled by the options passed to
5115           pcre_compile() or by a special sequence at the start of the pattern, as
5116           described in the section entitled  "Newline  conventions"  above.  Note
5117           that  the  end of this type of comment is a literal newline sequence in
5118           the pattern; escape sequences that happen to represent a newline do not
5119           count.  For  example,  consider this pattern when PCRE_EXTENDED is set,
5120           and the default newline convention is in force:
5121    
5122             abc #comment \n still comment
5123    
5124           On encountering the # character, pcre_compile()  skips  along,  looking
5125           for  a newline in the pattern. The sequence \n is still literal at this
5126           stage, so it does not terminate the comment. Only an  actual  character
5127           with the code value 0x0a (the default newline) does so.
5128    
5129    
5130  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5131    
5132         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
5133         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
5134         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
5135         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5136         depth.         depth.
5137    
5138         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5139         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
5140         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
5141         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5142         parentheses problem can be created like this:         parentheses problem can be created like this:
5143    
# Line 5017  RECURSIVE PATTERNS Line 5147  RECURSIVE PATTERNS
5147         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5148    
5149         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5150         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
5151         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
5152         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
5153         into Perl at release 5.10.         into Perl at release 5.10.
5154    
5155         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
5156         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
5157         the  given  number, provided that it occurs inside that subpattern. (If         the given number, provided that it occurs inside that  subpattern.  (If
5158         not, it is a "subroutine" call, which is described  in  the  next  sec-         not,  it  is  a  "subroutine" call, which is described in the next sec-
5159         tion.)  The special item (?R) or (?0) is a recursive call of the entire         tion.) The special item (?R) or (?0) is a recursive call of the  entire
5160         regular expression.         regular expression.
5161    
5162         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5163         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5164    
5165           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5166    
5167         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
5168         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
5169         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
5170         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5171         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5172         parentheses.         parentheses.
5173    
5174         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
5175         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5176    
5177           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5178    
5179         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
5180         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5181    
5182         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5183         tricky.  This  is made easier by the use of relative references (a Perl         tricky. This is made easier by the use of relative references.  Instead
5184         5.10 feature).  Instead of (?1) in the  pattern  above  you  can  write         of (?1) in the pattern above you can write (?-2) to refer to the second
5185         (?-2) to refer to the second most recently opened parentheses preceding         most recently opened parentheses  preceding  the  recursion.  In  other
5186         the recursion. In other  words,  a  negative  number  counts  capturing         words,  a  negative  number counts capturing parentheses leftwards from
5187         parentheses leftwards from the point at which it is encountered.         the point at which it is encountered.
5188    
5189         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
5190         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
5191         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
5192         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
5193         section.         section.
5194    
5195         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
5196         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5197         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5198    
5199           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5200    
5201         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
5202         one is used.         one is used.
5203    
5204         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
5205         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5206         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5207         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
5208         applied to         applied to
5209    
5210           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5211    
5212         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
5213         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
5214         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
5215         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5216    
5217         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
5218         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
5219         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
5220         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5221    
5222           (ab(cd)ef)           (ab(cd)ef)
5223    
5224         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5225         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
5226         pattern is not matched at the top level, its final value is unset, even         pattern is not matched at the top level, its final value is unset, even
5227         if it is (temporarily) set at a deeper level.         if it is (temporarily) set at a deeper level.
5228    
5229         If  there are more than 15 capturing parentheses in a pattern, PCRE has         If there are more than 15 capturing parentheses in a pattern, PCRE  has
5230         to obtain extra memory to store data during a recursion, which it  does         to  obtain extra memory to store data during a recursion, which it does
5231         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5232         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5233    
5234         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
5235         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
5236         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
5237         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
5238         ted at the outer level.         ted at the outer level.
5239    
5240           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5241    
5242         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
5243         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
5244         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5245    
5246     Recursion difference from Perl     Recursion difference from Perl
5247    
5248         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
5249         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5250         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5251         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
5252         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
5253         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
5254         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5255    
5256           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5257    
5258         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5259         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
5260         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
5261         Consider the subject string "abcba":         Consider the subject string "abcba":
5262    
5263         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
5264         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5265         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5266         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
5267         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
5268    
5269         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
5270         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
5271         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
5272         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
5273         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
5274         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
5275         different:         different:
5276    
5277           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
5278    
5279         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
5280         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
5281         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
5282         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
5283         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
5284         use.         use.
5285    
5286         To change the pattern so that matches all palindromic strings, not just         To  change  the pattern so that it matches all palindromic strings, not
5287         those  with  an  odd number of characters, it is tempting to change the         just those with an odd number of characters, it is tempting  to  change
5288         pattern to this:         the pattern to this:
5289    
5290           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
5291    
5292         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
5293         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
5294         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
5295         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
5296         natives at the higher level:         natives at the higher level:
5297    
5298           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
5299    
5300         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
5301         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
5302    
5303           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5304    
5305         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5306         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5307         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
5308         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
5309         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
5310         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5311    
5312         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
5313         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
5314         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
5315         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
5316         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
5317         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
5318         natives, so the entire match fails.         natives, so the entire match fails.
5319    
5320    
5321  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5322    
5323         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
5324         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
5325         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
5326         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
5327         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
5328    
# Line 5204  SUBPATTERNS AS SUBROUTINES Line 5334  SUBPATTERNS AS SUBROUTINES
5334    
5335           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5336    
5337         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5338         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5339    
5340           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5341    
5342         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
5343         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
5344         above.         above.
5345    
5346         Like  recursive  subpatterns, a subroutine call is always treated as an         Like recursive subpatterns, a subroutine call is always treated  as  an
5347         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
5348         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
5349         there is a subsequent matching failure. Any capturing parentheses  that         there  is a subsequent matching failure. Any capturing parentheses that
5350         are  set  during  the  subroutine  call revert to their previous values         are set during the subroutine call  revert  to  their  previous  values
5351         afterwards.         afterwards.
5352    
5353         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
5354         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
5355         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5356    
5357           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5358    
5359         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
5360         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5361    
5362    
5363  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5364    
5365         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
5366         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5367         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
5368         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
5369         ten using this syntax:         ten using this syntax:
5370    
5371           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5372           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5373    
5374         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
5375         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5376    
5377           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5378    
5379         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
5380         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
5381         call.         call.
5382    
5383    
5384  CALLOUTS  CALLOUTS
5385    
5386         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5387         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
5388         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5389         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5390         tion.         tion.
5391    
5392         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5393         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5394         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
5395         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
5396         all calling out.         all calling out.
5397    
5398         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
5399         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
5400         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
5401         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
5402         points:         points:
5403    
5404           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5405    
5406         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5407         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
5408         numbered 255.         numbered 255.
5409    
5410         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5411         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
5412         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
5413         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
5414         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
5415         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5416         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5417    
5418    
5419  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5420    
5421         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5422         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5423         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5424         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5425         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5426         in this section.         in this section.
5427    
5428         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5429         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5430         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5431         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5432         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5433    
5434         If any of these verbs are used in an assertion or subroutine subpattern         If any of these verbs are used in an assertion or subroutine subpattern
5435         (including recursive subpatterns), their effect  is  confined  to  that         (including  recursive  subpatterns),  their  effect is confined to that
5436         subpattern;  it  does  not extend to the surrounding pattern. Note that         subpattern; it does not extend to the surrounding  pattern.  Note  that
5437         such subpatterns are processed as anchored at the point where they  are         such  subpatterns are processed as anchored at the point where they are
5438         tested.         tested.
5439    
5440         The  new verbs make use of what was previously invalid syntax: an open-         The new verbs make use of what was previously invalid syntax: an  open-
5441         ing parenthesis followed by an asterisk. They are generally of the form         ing parenthesis followed by an asterisk. They are generally of the form
5442         (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-         (*VERB) or (*VERB:NAME). Some may take either form, with differing  be-
5443         haviour, depending on whether or not an argument is present. An name is         haviour, depending on whether or not an argument is present. An name is
5444         a  sequence  of letters, digits, and underscores. If the name is empty,         a sequence of letters, digits, and underscores. If the name  is  empty,
5445         that is, if the closing parenthesis immediately follows the colon,  the         that  is, if the closing parenthesis immediately follows the colon, the
5446         effect is as if the colon were not there. Any number of these verbs may         effect is as if the colon were not there. Any number of these verbs may
5447         occur in a pattern.         occur in a pattern.
5448    
5449         PCRE contains some optimizations that are used to speed up matching  by         PCRE  contains some optimizations that are used to speed up matching by
5450         running some checks at the start of each match attempt. For example, it         running some checks at the start of each match attempt. For example, it
5451         may know the minimum length of matching subject, or that  a  particular         may  know  the minimum length of matching subject, or that a particular
5452         character  must  be present. When one of these optimizations suppresses         character must be present. When one of these  optimizations  suppresses
5453         the running of a match, any included backtracking verbs  will  not,  of         the  running  of  a match, any included backtracking verbs will not, of
5454         course, be processed. You can suppress the start-of-match optimizations         course, be processed. You can suppress the start-of-match optimizations
5455         by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().         by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
5456           pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
5457    
5458     Verbs that act immediately     Verbs that act immediately
5459    
# Line 5511  BACKTRACKING CONTROL Line 5642  BACKTRACKING CONTROL
5642    
5643           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
5644    
5645         This verb causes a skip to the next alternation if the rest of the pat-         This  verb  causes  a  skip  to  the  next alternation in the innermost
5646         tern does not match. That is, it cancels pending backtracking, but only         enclosing group if the rest of the pattern does not match. That is,  it
5647         within  the  current  alternation.  Its name comes from the observation         cancels  pending backtracking, but only within the current alternation.
5648         that it can be used for a pattern-based if-then-else block:         Its name comes from the observation that it can be used for a  pattern-
5649           based if-then-else block:
5650    
5651           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5652    
5653         If the COND1 pattern matches, FOO is tried (and possibly further  items         If  the COND1 pattern matches, FOO is tried (and possibly further items
5654         after  the  end  of  the group if FOO succeeds); on failure the matcher         after the end of the group if FOO succeeds);  on  failure  the  matcher
5655         skips to the second alternative and tries COND2,  without  backtracking         skips  to  the second alternative and tries COND2, without backtracking
5656         into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as         into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
5657         (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not         (*MARK:NAME)(*THEN)  if  the  overall  match  fails.  If (*THEN) is not
5658         directly inside an alternation, it acts like (*PRUNE).         directly inside an alternation, it acts like (*PRUNE).
5659    
5660           The above verbs provide four different "strengths" of control when sub-
5661           sequent  matching  fails. (*THEN) is the weakest, carrying on the match
5662           at the next alternation. (*PRUNE) comes next, failing the match at  the
5663           current  starting position, but allowing an advance to the next charac-
5664           ter (for an unanchored pattern). (*SKIP) is similar,  except  that  the
5665           advance  may  be  more  than one character. (*COMMIT) is the strongest,
5666           causing the entire match to fail.
5667    
5668           If more than one is present in a pattern, the "stongest" one wins.  For
5669           example,  consider  this  pattern, where A, B, etc. are complex pattern
5670           fragments:
5671    
5672             (A(*COMMIT)B(*THEN)C|D)
5673    
5674           Once A has matched, PCRE is committed to this  match,  at  the  current
5675           starting  position. If subsequently B matches, but C does not, the nor-
5676           mal (*THEN) action of trying the next alternation (that is, D) does not
5677           happen because (*COMMIT) overrides.
5678    
5679    
5680  SEE ALSO  SEE ALSO
5681    
# Line 5540  AUTHOR Line 5691  AUTHOR
5691    
5692  REVISION  REVISION
5693    
5694         Last updated: 18 May 2010         Last updated: 21 November 2010
5695         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
5696  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5697    
# Line 5568  QUOTING Line 5719  QUOTING
5719  CHARACTERS  CHARACTERS
5720    
5721           \a         alarm, that is, the BEL character (hex 07)           \a         alarm, that is, the BEL character (hex 07)
5722           \cx        "control-x", where x is any character           \cx        "control-x", where x is any ASCII character
5723           \e         escape (hex 1B)           \e         escape (hex 1B)
5724           \f         formfeed (hex 0C)           \f         formfeed (hex 0C)
5725           \n         newline (hex 0A)           \n         newline (hex 0A)
# Line 5787  OPTION SETTING Line 5938  OPTION SETTING
5938         The following are recognized only at the start of a  pattern  or  after         The following are recognized only at the start of a  pattern  or  after
5939         one of the newline-setting options with similar syntax:         one of the newline-setting options with similar syntax:
5940    
5941             (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
5942           (*UTF8)         set UTF-8 mode (PCRE_UTF8)           (*UTF8)         set UTF-8 mode (PCRE_UTF8)
5943           (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)           (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
5944