/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 567 by ph10, Sat Nov 6 17:10:00 2010 UTC revision 572 by ph10, Wed Nov 17 17:55:57 2010 UTC
# Line 26  INTRODUCTION Line 26  INTRODUCTION
26         give better JavaScript compatibility.         give better JavaScript compatibility.
27    
28         The  current implementation of PCRE corresponds approximately with Perl         The  current implementation of PCRE corresponds approximately with Perl
29         5.10/5.11, including support for UTF-8 encoded strings and Unicode gen-         5.12, including support for UTF-8 encoded strings and  Unicode  general
30         eral  category properties. However, UTF-8 and Unicode support has to be         category  properties.  However,  UTF-8  and  Unicode  support has to be
31         explicitly enabled; it is not the default. The  Unicode  tables  corre-         explicitly enabled; it is not the default. The  Unicode  tables  corre-
32         spond to Unicode release 5.2.0.         spond to Unicode release 5.2.0.
33    
# Line 238  UTF-8 AND UNICODE PROPERTY SUPPORT Line 238  UTF-8 AND UNICODE PROPERTY SUPPORT
238         7.  Similarly,  characters that match the POSIX named character classes         7.  Similarly,  characters that match the POSIX named character classes
239         are all low-valued characters, unless the PCRE_UCP option is set.         are all low-valued characters, unless the PCRE_UCP option is set.
240    
241         8. However, the Perl 5.10 horizontal and vertical  whitespace  matching         8. However, the horizontal and  vertical  whitespace  matching  escapes
242         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-         (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
243         acters, whether or not PCRE_UCP is set.         whether or not PCRE_UCP is set.
244    
245         9. Case-insensitive matching applies only to  characters  whose  values         9. Case-insensitive matching applies only to  characters  whose  values
246         are  less than 128, unless PCRE is built with Unicode property support.         are  less than 128, unless PCRE is built with Unicode property support.
247         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
248         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
249         so as not to degrade performance.  The Unicode property information  is         so as not to degrade performance.  The Unicode property information  is
250         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Furthermore, PCRE supports
251         support is available, PCRE supports case-insensitive matching only when         case-insensitive matching only  when  there  is  a  one-to-one  mapping
252         there  is  a  one-to-one  mapping between a letter's cases. There are a         between  a letter's cases. There are a small number of many-to-one map-
253         small number of many-to-one mappings in Unicode;  these  are  not  sup-         pings in Unicode; these are not supported by PCRE.
        ported by PCRE.  
254    
255    
256  AUTHOR  AUTHOR
# Line 260  AUTHOR Line 259  AUTHOR
259         University Computing Service         University Computing Service
260         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
261    
262         Putting  an actual email address here seems to have been a spam magnet,         Putting an actual email address here seems to have been a spam  magnet,
263         so I've taken it away. If you want to email me, use  my  two  initials,         so  I've  taken  it away. If you want to email me, use my two initials,
264         followed by the two digits 10, at the domain cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
265    
266    
267  REVISION  REVISION
268    
269         Last updated: 22 October 2010         Last updated: 13 November 2010
270         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
271  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
272    
# Line 697  THE ALTERNATIVE MATCHING ALGORITHM Line 696  THE ALTERNATIVE MATCHING ALGORITHM
696         represent the different matching possibilities (if there are none,  the         represent the different matching possibilities (if there are none,  the
697         match  has  failed).   Thus,  if there is more than one possible match,         match  has  failed).   Thus,  if there is more than one possible match,
698         this algorithm finds all of them, and in particular, it finds the long-         this algorithm finds all of them, and in particular, it finds the long-
699         est.  There  is  an  option to stop the algorithm after the first match         est.  The  matches are returned in decreasing order of length. There is
700         (which is necessarily the shortest) is found.         an option to stop the algorithm after the first match (which is  neces-
701           sarily the shortest) is found.
702    
703         Note that all the matches that are found start at the same point in the         Note that all the matches that are found start at the same point in the
704         subject. If the pattern         subject. If the pattern
705    
706           cat(er(pillar)?)           cat(er(pillar)?)?
707    
708         is  matched  against the string "the caterpillar catchment", the result         is matched against the string "the caterpillar catchment",  the  result
709         will be the three strings "cat", "cater", and "caterpillar" that  start         will  be the three strings "caterpillar", "cater", and "cat" that start
710         at the fourth character of the subject. The algorithm does not automat-         at the fifth character of the subject. The algorithm does not automati-
711         ically move on to find matches that start at later positions.         cally move on to find matches that start at later positions.
712    
713         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
714         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
715    
716         1.  Because  the  algorithm  finds  all possible matches, the greedy or         1. Because the algorithm finds all  possible  matches,  the  greedy  or
717         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
718         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
719         sessive quantifiers can make a difference when what follows could  also         sessive  quantifiers can make a difference when what follows could also
720         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
721    
722           ^a++\w!           ^a++\w!
723    
724         This  pattern matches "aaab!" but not "aaa!", which would be matched by         This pattern matches "aaab!" but not "aaa!", which would be matched  by
725         a non-possessive quantifier. Similarly, if an atomic group is  present,         a  non-possessive quantifier. Similarly, if an atomic group is present,
726         it  is matched as if it were a standalone pattern at the current point,         it is matched as if it were a standalone pattern at the current  point,
727         and the longest match is then "locked in" for the rest of  the  overall         and  the  longest match is then "locked in" for the rest of the overall
728         pattern.         pattern.
729    
730         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
731         is not straightforward to keep track of  captured  substrings  for  the         is  not  straightforward  to  keep track of captured substrings for the
732         different  matching  possibilities,  and  PCRE's implementation of this         different matching possibilities, and  PCRE's  implementation  of  this
733         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
734         strings are available.         strings are available.
735    
736         3.  Because no substrings are captured, back references within the pat-         3. Because no substrings are captured, back references within the  pat-
737         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
738    
739         4. For the same reason, conditional expressions that use  a  backrefer-         4.  For  the same reason, conditional expressions that use a backrefer-
740         ence  as  the  condition or test for a specific group recursion are not         ence as the condition or test for a specific group  recursion  are  not
741         supported.         supported.
742    
743         5. Because many paths through the tree may be  active,  the  \K  escape         5.  Because  many  paths  through the tree may be active, the \K escape
744         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
745         be on some paths and not on others), is not  supported.  It  causes  an         be  on  some  paths  and not on others), is not supported. It causes an
746         error if encountered.         error if encountered.
747    
748         6.  Callouts  are  supported, but the value of the capture_top field is         6. Callouts are supported, but the value of the  capture_top  field  is
749         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
750    
751         7. The \C escape sequence, which (in the standard algorithm) matches  a         7.  The \C escape sequence, which (in the standard algorithm) matches a
752         single  byte, even in UTF-8 mode, is not supported because the alterna-         single byte, even in UTF-8 mode, is not supported because the  alterna-
753         tive algorithm moves through the subject  string  one  character  at  a         tive  algorithm  moves  through  the  subject string one character at a
754         time, for all active paths through the tree.         time, for all active paths through the tree.
755    
756         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
757         are not supported. (*FAIL) is supported, and  behaves  like  a  failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
758         negative assertion.         negative assertion.
759    
760    
761  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
762    
763         Using  the alternative matching algorithm provides the following advan-         Using the alternative matching algorithm provides the following  advan-
764         tages:         tages:
765    
766         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
767         ically  found,  and  in particular, the longest match is found. To find         ically found, and in particular, the longest match is  found.  To  find
768         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
769         things with callouts.         things with callouts.
770    
771         2.  Because  the  alternative  algorithm  scans the subject string just         2. Because the alternative algorithm  scans  the  subject  string  just
772         once, and never needs to backtrack, it is possible to  pass  very  long         once,  and  never  needs to backtrack, it is possible to pass very long
773         subject  strings  to  the matching function in several pieces, checking         subject strings to the matching function in  several  pieces,  checking
774         for partial matching each time. It  is  possible  to  do  multi-segment         for  partial  matching  each time. Although it is possible to do multi-
775         matching using pcre_exec() (by retaining partially matched substrings),         segment matching using the standard algorithm (pcre_exec()), by retain-
776         but it is more complicated. The pcrepartial documentation gives details         ing  partially matched substrings, it is more complicated. The pcrepar-
777         of partial matching and discusses multi-segment matching.         tial documentation gives details  of  partial  matching  and  discusses
778           multi-segment matching.
779    
780    
781  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
# Line 800  AUTHOR Line 801  AUTHOR
801    
802  REVISION  REVISION
803    
804         Last updated: 22 October 2010         Last updated: 17 November 2010
805         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
806  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
807    
# Line 1171  COMPILING A PATTERN Line 1172  COMPILING A PATTERN
1172         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1173         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1174         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
1175         try to free it. The byte offset from the start of the  pattern  to  the         try to free it. The offset from the start of the pattern  to  the  byte
1176         character  that  was  being  processed when the error was discovered is         that was being processed when the error was discovered is placed in the
1177         placed in the variable pointed to by erroffset, which must not be NULL.         variable pointed to by erroffset, which must not be NULL. If it is,  an
1178         If  it  is,  an  immediate error is given. Some errors are not detected         immediate error is given. Some errors are not detected until checks are
1179         until checks are carried out when the whole pattern has  been  scanned;         carried out when the whole pattern has been scanned; in this  case  the
1180         in this case the offset is set to the end of the pattern.         offset is set to the end of the pattern.
1181    
1182           Note  that  the offset is in bytes, not characters, even in UTF-8 mode.
1183           It may point into the middle of a UTF-8 character  (for  example,  when
1184           PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
1185    
1186         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1187         codeptr argument is not NULL, a non-zero error code number is  returned         codeptr argument is not NULL, a non-zero error code number is  returned
# Line 1254  COMPILING A PATTERN Line 1259  COMPILING A PATTERN
1259    
1260           PCRE_DOTALL           PCRE_DOTALL
1261    
1262         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharacter in the pattern matches a  char-
1263         acters,  including  those that indicate newline. Without it, a dot does         acter of any value, including one that indicates a newline. However, it
1264         not match when the current position is at a  newline.  This  option  is         only ever matches one character, even if newlines are  coded  as  CRLF.
1265         equivalent  to Perl's /s option, and it can be changed within a pattern         Without  this option, a dot does not match when the current position is
1266         by a (?s) option setting. A negative class such as [^a] always  matches         at a newline. This option is equivalent to Perl's /s option, and it can
1267         newline characters, independent of the setting of this option.         be  changed within a pattern by a (?s) option setting. A negative class
1268           such as [^a] always matches newline characters, independent of the set-
1269           ting of this option.
1270    
1271           PCRE_DUPNAMES           PCRE_DUPNAMES
1272    
# Line 1279  COMPILING A PATTERN Line 1286  COMPILING A PATTERN
1286         option, and it can be changed within a pattern by a  (?x)  option  set-         option, and it can be changed within a pattern by a  (?x)  option  set-
1287         ting.         ting.
1288    
1289         This  option  makes  it possible to include comments inside complicated         Which  characters  are  interpreted  as  newlines  is controlled by the
1290         patterns.  Note, however, that this applies only  to  data  characters.         options passed to pcre_compile() or by a special sequence at the  start
1291         Whitespace   characters  may  never  appear  within  special  character         of  the  pattern, as described in the section entitled "Newline conven-
1292         sequences in a pattern, for  example  within  the  sequence  (?(  which         tions" in the pcrepattern documentation. Note that the end of this type
1293         introduces a conditional subpattern.         of  comment  is  a  literal  newline  sequence  in  the pattern; escape
1294           sequences that happen to represent a newline do not count.
1295    
1296           This option makes it possible to include  comments  inside  complicated
1297           patterns.   Note,  however,  that this applies only to data characters.
1298           Whitespace  characters  may  never  appear  within  special   character
1299           sequences in a pattern, for example within the sequence (?( that intro-
1300           duces a conditional subpattern.
1301    
1302           PCRE_EXTRA           PCRE_EXTRA
1303    
1304         This  option  was invented in order to turn on additional functionality         This option was invented in order to turn on  additional  functionality
1305         of PCRE that is incompatible with Perl, but it  is  currently  of  very         of  PCRE  that  is  incompatible with Perl, but it is currently of very
1306         little  use. When set, any backslash in a pattern that is followed by a         little use. When set, any backslash in a pattern that is followed by  a
1307         letter that has no special meaning  causes  an  error,  thus  reserving         letter  that  has  no  special  meaning causes an error, thus reserving
1308         these  combinations  for  future  expansion.  By default, as in Perl, a         these combinations for future expansion. By  default,  as  in  Perl,  a
1309         backslash followed by a letter with no special meaning is treated as  a         backslash  followed by a letter with no special meaning is treated as a
1310         literal. (Perl can, however, be persuaded to give an error for this, by         literal. (Perl can, however, be persuaded to give an error for this, by
1311         running it with the -w option.) There are at present no other  features         running  it with the -w option.) There are at present no other features
1312         controlled  by this option. It can also be set by a (?X) option setting         controlled by this option. It can also be set by a (?X) option  setting
1313         within a pattern.         within a pattern.
1314    
1315           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1316    
1317         If this option is set, an  unanchored  pattern  is  required  to  match         If  this  option  is  set,  an  unanchored pattern is required to match
1318         before  or  at  the  first  newline  in  the subject string, though the         before or at the first  newline  in  the  subject  string,  though  the
1319         matched text may continue over the newline.         matched text may continue over the newline.
1320    
1321           PCRE_JAVASCRIPT_COMPAT           PCRE_JAVASCRIPT_COMPAT
1322    
1323         If this option is set, PCRE's behaviour is changed in some ways so that         If this option is set, PCRE's behaviour is changed in some ways so that
1324         it  is  compatible with JavaScript rather than Perl. The changes are as         it is compatible with JavaScript rather than Perl. The changes  are  as
1325         follows:         follows:
1326    
1327         (1) A lone closing square bracket in a pattern  causes  a  compile-time         (1)  A  lone  closing square bracket in a pattern causes a compile-time
1328         error,  because this is illegal in JavaScript (by default it is treated         error, because this is illegal in JavaScript (by default it is  treated
1329         as a data character). Thus, the pattern AB]CD becomes illegal when this         as a data character). Thus, the pattern AB]CD becomes illegal when this
1330         option is set.         option is set.
1331    
1332         (2)  At run time, a back reference to an unset subpattern group matches         (2) At run time, a back reference to an unset subpattern group  matches
1333         an empty string (by default this causes the current  matching  alterna-         an  empty  string (by default this causes the current matching alterna-
1334         tive  to  fail). A pattern such as (\1)(a) succeeds when this option is         tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1335         set (assuming it can find an "a" in the subject), whereas it  fails  by         set  (assuming  it can find an "a" in the subject), whereas it fails by
1336         default, for Perl compatibility.         default, for Perl compatibility.
1337    
1338           PCRE_MULTILINE           PCRE_MULTILINE
1339    
1340         By  default,  PCRE  treats the subject string as consisting of a single         By default, PCRE treats the subject string as consisting  of  a  single
1341         line of characters (even if it actually contains newlines). The  "start         line  of characters (even if it actually contains newlines). The "start
1342         of  line"  metacharacter  (^)  matches only at the start of the string,         of line" metacharacter (^) matches only at the  start  of  the  string,
1343         while the "end of line" metacharacter ($) matches only at  the  end  of         while  the  "end  of line" metacharacter ($) matches only at the end of
1344         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1345         is set). This is the same as Perl.         is set). This is the same as Perl.
1346    
1347         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1348         constructs  match  immediately following or immediately before internal         constructs match immediately following or immediately  before  internal
1349         newlines in the subject string, respectively, as well as  at  the  very         newlines  in  the  subject string, respectively, as well as at the very
1350         start  and  end.  This is equivalent to Perl's /m option, and it can be         start and end. This is equivalent to Perl's /m option, and  it  can  be
1351         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
1352         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1353         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1354    
1355           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1344  COMPILING A PATTERN Line 1358  COMPILING A PATTERN
1358           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
1359           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1360    
1361         These options override the default newline definition that  was  chosen         These  options  override the default newline definition that was chosen
1362         when  PCRE  was built. Setting the first or the second specifies that a         when PCRE was built. Setting the first or the second specifies  that  a
1363         newline is indicated by a single character (CR  or  LF,  respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1364         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1365         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1366         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
1367         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1368         recognized. The Unicode newline sequences are the three just mentioned,         recognized. The Unicode newline sequences are the three just mentioned,
1369         plus the single characters VT (vertical  tab,  U+000B),  FF  (formfeed,         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1370         U+000C),  NEL  (next line, U+0085), LS (line separator, U+2028), and PS         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1371         (paragraph separator, U+2029). The last  two  are  recognized  only  in         (paragraph  separator,  U+2029).  The  last  two are recognized only in
1372         UTF-8 mode.         UTF-8 mode.
1373    
1374         The  newline  setting  in  the  options  word  uses three bits that are         The newline setting in the  options  word  uses  three  bits  that  are
1375         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
1376         used  (default  plus the five values above). This means that if you set         used (default plus the five values above). This means that if  you  set
1377         more than one newline option, the combination may or may not be  sensi-         more  than one newline option, the combination may or may not be sensi-
1378         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1379         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1380         cause an error.         cause an error.
1381    
1382         The  only time that a line break is specially recognized when compiling         The only time that a line break in a pattern  is  specially  recognized
1383         a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a         when  compiling  is when PCRE_EXTENDED is set. CR and LF are whitespace
1384         character  class  is  encountered.  This indicates a comment that lasts         characters, and so are ignored in this mode. Also, an unescaped #  out-
1385         until after the next line break sequence. In other circumstances,  line         side  a  character class indicates a comment that lasts until after the
1386         break   sequences   are   treated  as  literal  data,  except  that  in         next line break sequence. In other circumstances, line break  sequences
1387         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         in patterns are treated as literal data.
        and are therefore ignored.  
1388    
1389         The newline option that is set at compile time becomes the default that         The newline option that is set at compile time becomes the default that
1390         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
# Line 1386  COMPILING A PATTERN Line 1399  COMPILING A PATTERN
1399    
1400           PCRE_UCP           PCRE_UCP
1401    
1402         This option changes the way PCRE processes \b, \d, \s, \w, and some  of         This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
1403         the POSIX character classes. By default, only ASCII characters are rec-         \w,  and  some  of  the POSIX character classes. By default, only ASCII
1404         ognized, but if PCRE_UCP is set, Unicode properties are used instead to         characters are recognized, but if PCRE_UCP is set,  Unicode  properties
1405         classify  characters.  More details are given in the section on generic         are  used instead to classify characters. More details are given in the
1406         character types in the pcrepattern page. If you set PCRE_UCP,  matching         section on generic character types in the pcrepattern page. If you  set
1407         one  of the items it affects takes much longer. The option is available         PCRE_UCP,  matching  one of the items it affects takes much longer. The
1408         only if PCRE has been compiled with Unicode property support.         option is available only if PCRE has been compiled with  Unicode  prop-
1409           erty support.
1410    
1411           PCRE_UNGREEDY           PCRE_UNGREEDY
1412    
1413         This option inverts the "greediness" of the quantifiers  so  that  they         This  option  inverts  the "greediness" of the quantifiers so that they
1414         are  not greedy by default, but become greedy if followed by "?". It is         are not greedy by default, but become greedy if followed by "?". It  is
1415         not compatible with Perl. It can also be set by a (?U)  option  setting         not  compatible  with Perl. It can also be set by a (?U) option setting
1416         within the pattern.         within the pattern.
1417    
1418           PCRE_UTF8           PCRE_UTF8
1419    
1420         This  option  causes PCRE to regard both the pattern and the subject as         This option causes PCRE to regard both the pattern and the  subject  as
1421         strings of UTF-8 characters instead of single-byte  character  strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1422         However,  it is available only when PCRE is built to include UTF-8 sup-         However, it is available only when PCRE is built to include UTF-8  sup-
1423         port. If not, the use of this option provokes an error. Details of  how         port.  If not, the use of this option provokes an error. Details of how
1424         this  option  changes the behaviour of PCRE are given in the section on         this option changes the behaviour of PCRE are given in the  section  on
1425         UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1426    
1427           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1428    
1429         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1430         automatically  checked.  There  is  a  discussion about the validity of         automatically checked. There is a  discussion  about  the  validity  of
1431         UTF-8 strings in the main pcre page. If an invalid  UTF-8  sequence  of         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1432         bytes  is  found,  pcre_compile() returns an error. If you already know         bytes is found, pcre_compile() returns an error. If  you  already  know
1433         that your pattern is valid, and you want to skip this check for perfor-         that your pattern is valid, and you want to skip this check for perfor-
1434         mance  reasons,  you  can set the PCRE_NO_UTF8_CHECK option. When it is         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1435         set, the effect of passing an invalid UTF-8  string  as  a  pattern  is         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1436         undefined.  It  may  cause your program to crash. Note that this option         undefined. It may cause your program to crash. Note  that  this  option
1437         can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress  the         can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1438         UTF-8 validity checking of subject strings.         UTF-8 validity checking of subject strings.
1439    
1440    
1441  COMPILATION ERROR CODES  COMPILATION ERROR CODES
1442    
1443         The  following  table  lists  the  error  codes than may be returned by         The following table lists the error  codes  than  may  be  returned  by
1444         pcre_compile2(), along with the error messages that may be returned  by         pcre_compile2(),  along with the error messages that may be returned by
1445         both  compiling functions. As PCRE has developed, some error codes have         both compiling functions. As PCRE has developed, some error codes  have
1446         fallen out of use. To avoid confusion, they have not been re-used.         fallen out of use. To avoid confusion, they have not been re-used.
1447    
1448            0  no error            0  no error
# Line 1503  COMPILATION ERROR CODES Line 1517  COMPILATION ERROR CODES
1517           66  (*MARK) must have an argument           66  (*MARK) must have an argument
1518           67  this version of PCRE is not compiled with PCRE_UCP support           67  this version of PCRE is not compiled with PCRE_UCP support
1519    
1520         The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
1521         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
1522    
1523    
# Line 1512  STUDYING A PATTERN Line 1526  STUDYING A PATTERN
1526         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1527              const char **errptr);              const char **errptr);
1528    
1529         If  a  compiled  pattern is going to be used several times, it is worth         If a compiled pattern is going to be used several times,  it  is  worth
1530         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1531         matching.  The function pcre_study() takes a pointer to a compiled pat-         matching. The function pcre_study() takes a pointer to a compiled  pat-
1532         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1533         information  that  will  help speed up matching, pcre_study() returns a         information that will help speed up matching,  pcre_study()  returns  a
1534         pointer to a pcre_extra block, in which the study_data field points  to         pointer  to a pcre_extra block, in which the study_data field points to
1535         the results of the study.         the results of the study.
1536    
1537         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1538         pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-         pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-
1539         tains  other  fields  that can be set by the caller before the block is         tains other fields that can be set by the caller before  the  block  is
1540         passed; these are described below in the section on matching a pattern.         passed; these are described below in the section on matching a pattern.
1541    
1542         If studying the  pattern  does  not  produce  any  useful  information,         If  studying  the  pattern  does  not  produce  any useful information,
1543         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1544         wants  to  pass  any  of   the   other   fields   to   pcre_exec()   or         wants   to   pass   any   of   the   other  fields  to  pcre_exec()  or
1545         pcre_dfa_exec(), it must set up its own pcre_extra block.         pcre_dfa_exec(), it must set up its own pcre_extra block.
1546    
1547         The  second  argument of pcre_study() contains option bits. At present,         The second argument of pcre_study() contains option bits.  At  present,
1548         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1549    
1550         The third argument for pcre_study() is a pointer for an error  message.         The  third argument for pcre_study() is a pointer for an error message.
1551         If  studying  succeeds  (even  if no data is returned), the variable it         If studying succeeds (even if no data is  returned),  the  variable  it
1552         points to is set to NULL. Otherwise it is set to  point  to  a  textual         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1553         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1554         must not try to free it. You should test the  error  pointer  for  NULL         must  not  try  to  free it. You should test the error pointer for NULL
1555         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1556    
1557         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 1551  STUDYING A PATTERN Line 1565  STUDYING A PATTERN
1565         Studying a pattern does two things: first, a lower bound for the length         Studying a pattern does two things: first, a lower bound for the length
1566         of subject string that is needed to match the pattern is computed. This         of subject string that is needed to match the pattern is computed. This
1567         does not mean that there are any strings of that length that match, but         does not mean that there are any strings of that length that match, but
1568         it does guarantee that no shorter strings match. The value is  used  by         it  does  guarantee that no shorter strings match. The value is used by
1569         pcre_exec()  and  pcre_dfa_exec()  to  avoid  wasting time by trying to         pcre_exec() and pcre_dfa_exec() to avoid  wasting  time  by  trying  to
1570         match strings that are shorter than the lower bound. You can  find  out         match  strings  that are shorter than the lower bound. You can find out
1571         the value in a calling program via the pcre_fullinfo() function.         the value in a calling program via the pcre_fullinfo() function.
1572    
1573         Studying a pattern is also useful for non-anchored patterns that do not         Studying a pattern is also useful for non-anchored patterns that do not
1574         have a single fixed starting character. A bitmap of  possible  starting         have  a  single fixed starting character. A bitmap of possible starting
1575         bytes  is  created. This speeds up finding a position in the subject at         bytes is created. This speeds up finding a position in the  subject  at
1576         which to start matching.         which to start matching.
1577    
1578         The two optimizations just described can be  disabled  by  setting  the         The  two  optimizations  just  described can be disabled by setting the
1579         PCRE_NO_START_OPTIMIZE    option    when    calling    pcre_exec()   or         PCRE_NO_START_OPTIMIZE   option    when    calling    pcre_exec()    or
1580         pcre_dfa_exec(). You might want to do this  if  your  pattern  contains         pcre_dfa_exec().  You  might  want  to do this if your pattern contains
1581         callouts,  or  make  use of (*MARK), and you make use of these in cases         callouts or (*MARK), and you want to make use of  these  facilities  in
1582         where matching fails.  See  the  discussion  of  PCRE_NO_START_OPTIMIZE         cases  where  matching fails. See the discussion of PCRE_NO_START_OPTI-
1583         below.         MIZE below.
1584    
1585    
1586  LOCALE SUPPORT  LOCALE SUPPORT
1587    
1588         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1589         letters, digits, or whatever, by reference to a set of tables,  indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1590         by  character  value.  When running in UTF-8 mode, this applies only to         by character value. When running in UTF-8 mode, this  applies  only  to
1591         characters with codes less than 128. By  default,  higher-valued  codes         characters  with  codes  less than 128. By default, higher-valued codes
1592         never match escapes such as \w or \d, but they can be tested with \p if         never match escapes such as \w or \d, but they can be tested with \p if
1593         PCRE is built with Unicode character property  support.  Alternatively,         PCRE  is  built with Unicode character property support. Alternatively,
1594         the  PCRE_UCP  option  can  be  set at compile time; this causes \w and         the PCRE_UCP option can be set at compile  time;  this  causes  \w  and
1595         friends to use Unicode property support instead of built-in tables. The         friends to use Unicode property support instead of built-in tables. The
1596         use of locales with Unicode is discouraged. If you are handling charac-         use of locales with Unicode is discouraged. If you are handling charac-
1597         ters with codes greater than 128, you should either use UTF-8 and  Uni-         ters  with codes greater than 128, you should either use UTF-8 and Uni-
1598         code, or use locales, but not try to mix the two.         code, or use locales, but not try to mix the two.
1599    
1600         PCRE  contains  an  internal set of tables that are used when the final         PCRE contains an internal set of tables that are used  when  the  final
1601         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1602         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
1603         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
1604         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
1605         which may cause them to be different.         which may cause them to be different.
1606    
1607         The internal tables can always be overridden by tables supplied by  the         The  internal tables can always be overridden by tables supplied by the
1608         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
1609         from the default. As more and more applications change  to  using  Uni-         from  the  default.  As more and more applications change to using Uni-
1610         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
1611    
1612         External  tables  are  built by calling the pcre_maketables() function,         External tables are built by calling  the  pcre_maketables()  function,
1613         which has no arguments, in the relevant locale. The result can then  be         which  has no arguments, in the relevant locale. The result can then be
1614         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1615         example, to build and use tables that are appropriate  for  the  French         example,  to  build  and use tables that are appropriate for the French
1616         locale  (where  accented  characters  with  values greater than 128 are         locale (where accented characters with  values  greater  than  128  are
1617         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1618    
1619           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1620           tables = pcre_maketables();           tables = pcre_maketables();
1621           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1622    
1623         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1624         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
1625    
1626         When  pcre_maketables()  runs,  the  tables are built in memory that is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1627         obtained via pcre_malloc. It is the caller's responsibility  to  ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1628         that  the memory containing the tables remains available for as long as         that the memory containing the tables remains available for as long  as
1629         it is needed.         it is needed.
1630    
1631         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1632         pattern,  and the same tables are used via this pointer by pcre_study()         pattern, and the same tables are used via this pointer by  pcre_study()
1633         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1634         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1635         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1636    
1637         It is possible to pass a table pointer or NULL (indicating the  use  of         It  is  possible to pass a table pointer or NULL (indicating the use of
1638         the  internal  tables)  to  pcre_exec(). Although not intended for this         the internal tables) to pcre_exec(). Although  not  intended  for  this
1639         purpose, this facility could be used to match a pattern in a  different         purpose,  this facility could be used to match a pattern in a different
1640         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1641         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1642    
# Line 1632  INFORMATION ABOUT A PATTERN Line 1646  INFORMATION ABOUT A PATTERN
1646         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1647              int what, void *where);              int what, void *where);
1648    
1649         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1650         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1651         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1652    
1653         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1654         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1655         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1656         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1657         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1658         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1659    
1660           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1648  INFORMATION ABOUT A PATTERN Line 1662  INFORMATION ABOUT A PATTERN
1662           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1663           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1664    
1665         The  "magic  number" is placed at the start of each compiled pattern as         The "magic number" is placed at the start of each compiled  pattern  as
1666         an simple check against passing an arbitrary memory pointer. Here is  a         an  simple check against passing an arbitrary memory pointer. Here is a
1667         typical  call  of pcre_fullinfo(), to obtain the length of the compiled         typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1668         pattern:         pattern:
1669    
1670           int rc;           int rc;
# Line 1661  INFORMATION ABOUT A PATTERN Line 1675  INFORMATION ABOUT A PATTERN
1675             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1676             &length);         /* where to put the data */             &length);         /* where to put the data */
1677    
1678         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1679         are as follows:         are as follows:
1680    
1681           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1682    
1683         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1684         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1685         there are no back references.         there are no back references.
1686    
1687           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1688    
1689         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1690         argument should point to an int variable.         argument should point to an int variable.
1691    
1692           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1693    
1694         Return a pointer to the internal default character tables within  PCRE.         Return  a pointer to the internal default character tables within PCRE.
1695         The  fourth  argument should point to an unsigned char * variable. This         The fourth argument should point to an unsigned char *  variable.  This
1696         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1697         tion.  External  callers  can  cause PCRE to use its internal tables by         tion. External callers can cause PCRE to use  its  internal  tables  by
1698         passing a NULL table pointer.         passing a NULL table pointer.
1699    
1700           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1701    
1702         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1703         non-anchored  pattern. The fourth argument should point to an int vari-         non-anchored pattern. The fourth argument should point to an int  vari-
1704         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1705         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1706    
1707         If  there  is  a  fixed first byte, for example, from a pattern such as         If there is a fixed first byte, for example, from  a  pattern  such  as
1708         (cat|cow|coyote), its value is returned. Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1709    
1710         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1711         branch starts with "^", or         branch starts with "^", or
1712    
1713         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1714         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1715    
1716         -1 is returned, indicating that the pattern matches only at  the  start         -1  is  returned, indicating that the pattern matches only at the start
1717         of  a  subject string or after any newline within the string. Otherwise         of a subject string or after any newline within the  string.  Otherwise
1718         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1719    
1720           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1721    
1722         If the pattern was studied, and this resulted in the construction of  a         If  the pattern was studied, and this resulted in the construction of a
1723         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1724         matching string, a pointer to the table is returned. Otherwise NULL  is         matching  string, a pointer to the table is returned. Otherwise NULL is
1725         returned.  The fourth argument should point to an unsigned char * vari-         returned. The fourth argument should point to an unsigned char *  vari-
1726         able.         able.
1727    
1728           PCRE_INFO_HASCRORLF           PCRE_INFO_HASCRORLF
1729    
1730         Return 1 if the pattern contains any explicit  matches  for  CR  or  LF         Return  1  if  the  pattern  contains any explicit matches for CR or LF
1731         characters,  otherwise  0.  The  fourth argument should point to an int         characters, otherwise 0. The fourth argument should  point  to  an  int
1732         variable. An explicit match is either a literal CR or LF character,  or         variable.  An explicit match is either a literal CR or LF character, or
1733         \r or \n.         \r or \n.
1734    
1735           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
1736    
1737         Return  1  if  the (?J) or (?-J) option setting is used in the pattern,         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
1738         otherwise 0. The fourth argument should point to an int variable.  (?J)         otherwise  0. The fourth argument should point to an int variable. (?J)
1739         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1740    
1741           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1742    
1743         Return  the  value of the rightmost literal byte that must exist in any         Return the value of the rightmost literal byte that must exist  in  any
1744         matched string, other than at its  start,  if  such  a  byte  has  been         matched  string,  other  than  at  its  start,  if such a byte has been
1745         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1746         is no such byte, -1 is returned. For anchored patterns, a last  literal         is  no such byte, -1 is returned. For anchored patterns, a last literal
1747         byte  is  recorded only if it follows something of variable length. For         byte is recorded only if it follows something of variable  length.  For
1748         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1749         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1750    
1751           PCRE_INFO_MINLENGTH           PCRE_INFO_MINLENGTH
1752    
1753         If  the  pattern  was studied and a minimum length for matching subject         If the pattern was studied and a minimum length  for  matching  subject
1754         strings was computed, its value is  returned.  Otherwise  the  returned         strings  was  computed,  its  value is returned. Otherwise the returned
1755         value  is  -1. The value is a number of characters, not bytes (this may         value is -1. The value is a number of characters, not bytes  (this  may
1756         be relevant in UTF-8 mode). The fourth argument should point to an  int         be  relevant in UTF-8 mode). The fourth argument should point to an int
1757         variable.  A  non-negative  value is a lower bound to the length of any         variable. A non-negative value is a lower bound to the  length  of  any
1758         matching string. There may not be any strings of that  length  that  do         matching  string.  There  may not be any strings of that length that do
1759         actually match, but every string that does match is at least that long.         actually match, but every string that does match is at least that long.
1760    
1761           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
1762           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1763           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1764    
1765         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE supports the use of named as well as numbered capturing  parenthe-
1766         ses. The names are just an additional way of identifying the  parenthe-         ses.  The names are just an additional way of identifying the parenthe-
1767         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
1768         pcre_get_named_substring() are provided for  extracting  captured  sub-         pcre_get_named_substring()  are  provided  for extracting captured sub-
1769         strings  by  name. It is also possible to extract the data directly, by         strings by name. It is also possible to extract the data  directly,  by
1770         first converting the name to a number in order to  access  the  correct         first  converting  the  name to a number in order to access the correct
1771         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
1772         the conversion, you need  to  use  the  name-to-number  map,  which  is         the  conversion,  you  need  to  use  the  name-to-number map, which is
1773         described by these three values.         described by these three values.
1774    
1775         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1776         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1777         of  each  entry;  both  of  these  return  an int value. The entry size         of each entry; both of these  return  an  int  value.  The  entry  size
1778         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns
1779         a  pointer  to  the  first  entry of the table (a pointer to char). The         a pointer to the first entry of the table  (a  pointer  to  char).  The
1780         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1781         sis,  most  significant byte first. The rest of the entry is the corre-         sis, most significant byte first. The rest of the entry is  the  corre-
1782         sponding name, zero terminated.         sponding name, zero terminated.
1783    
1784         The names are in alphabetical order. Duplicate names may appear if  (?|         The  names are in alphabetical order. Duplicate names may appear if (?|
1785         is used to create multiple groups with the same number, as described in         is used to create multiple groups with the same number, as described in
1786         the section on duplicate subpattern numbers in  the  pcrepattern  page.         the  section  on  duplicate subpattern numbers in the pcrepattern page.
1787         Duplicate  names  for  subpatterns with different numbers are permitted         Duplicate names for subpatterns with different  numbers  are  permitted
1788         only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they         only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they
1789         appear  in  the table in the order in which they were found in the pat-         appear in the table in the order in which they were found in  the  pat-
1790         tern. In the absence of (?| this is the  order  of  increasing  number;         tern.  In  the  absence  of (?| this is the order of increasing number;
1791         when (?| is used this is not necessarily the case because later subpat-         when (?| is used this is not necessarily the case because later subpat-
1792         terns may have lower numbers.         terns may have lower numbers.
1793    
1794         As a simple example of the name/number table,  consider  the  following         As  a  simple  example of the name/number table, consider the following
1795         pattern  (assume  PCRE_EXTENDED is set, so white space - including new-         pattern (assume PCRE_EXTENDED is set, so white space -  including  new-
1796         lines - is ignored):         lines - is ignored):
1797    
1798           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1799           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1800    
1801         There are four named subpatterns, so the table has  four  entries,  and         There  are  four  named subpatterns, so the table has four entries, and
1802         each  entry  in the table is eight bytes long. The table is as follows,         each entry in the table is eight bytes long. The table is  as  follows,
1803         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1804         as ??:         as ??:
1805    
# Line 1794  INFORMATION ABOUT A PATTERN Line 1808  INFORMATION ABOUT A PATTERN
1808           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1809           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1810    
1811         When  writing  code  to  extract  data from named subpatterns using the         When writing code to extract data  from  named  subpatterns  using  the
1812         name-to-number map, remember that the length of the entries  is  likely         name-to-number  map,  remember that the length of the entries is likely
1813         to be different for each compiled pattern.         to be different for each compiled pattern.
1814    
1815           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
1816    
1817         Return  1  if  the  pattern  can  be  used  for  partial  matching with         Return 1  if  the  pattern  can  be  used  for  partial  matching  with
1818         pcre_exec(), otherwise 0. The fourth argument should point  to  an  int         pcre_exec(),  otherwise  0.  The fourth argument should point to an int
1819         variable.  From  release  8.00,  this  always  returns  1,  because the         variable. From  release  8.00,  this  always  returns  1,  because  the
1820         restrictions that previously applied  to  partial  matching  have  been         restrictions  that  previously  applied  to  partial matching have been
1821         lifted.  The  pcrepartial documentation gives details of partial match-         lifted. The pcrepartial documentation gives details of  partial  match-
1822         ing.         ing.
1823    
1824           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1825    
1826         Return a copy of the options with which the pattern was  compiled.  The         Return  a  copy of the options with which the pattern was compiled. The
1827         fourth  argument  should  point to an unsigned long int variable. These         fourth argument should point to an unsigned long  int  variable.  These
1828         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1829         by any top-level option settings at the start of the pattern itself. In         by any top-level option settings at the start of the pattern itself. In
1830         other words, they are the options that will be in force  when  matching         other  words,  they are the options that will be in force when matching
1831         starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with         starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
1832         the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,         the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1833         and PCRE_EXTENDED.         and PCRE_EXTENDED.
1834    
1835         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A pattern is automatically anchored by PCRE if  all  of  its  top-level
1836         alternatives begin with one of the following:         alternatives begin with one of the following:
1837    
1838           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1832  INFORMATION ABOUT A PATTERN Line 1846  INFORMATION ABOUT A PATTERN
1846    
1847           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1848    
1849         Return  the  size  of the compiled pattern, that is, the value that was         Return the size of the compiled pattern, that is, the  value  that  was
1850         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1851         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1852         size_t variable.         size_t variable.
# Line 1840  INFORMATION ABOUT A PATTERN Line 1854  INFORMATION ABOUT A PATTERN
1854           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1855    
1856         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1857         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to
1858         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1859         created  by  pcre_study().  If pcre_extra is NULL, or there is no study         created by pcre_study(). If pcre_extra is NULL, or there  is  no  study
1860         data, zero is returned. The fourth argument should point  to  a  size_t         data,  zero  is  returned. The fourth argument should point to a size_t
1861         variable.         variable.
1862    
1863    
# Line 1851  OBSOLETE INFO FUNCTION Line 1865  OBSOLETE INFO FUNCTION
1865    
1866         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1867    
1868         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1869         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1870         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1871         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1872         lowing negative numbers:         lowing negative numbers:
1873    
1874           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1875           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1876    
1877         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1878         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1879         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1880    
1881         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1882         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1883         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1884    
1885    
# Line 1873  REFERENCE COUNTS Line 1887  REFERENCE COUNTS
1887    
1888         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1889    
1890         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1891         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1892         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1893         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1894         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1895    
1896         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1897         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1898         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1899         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1900         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1901         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1902    
1903         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1904         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1905         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1906    
1907    
# Line 1897  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1911  MATCHING A PATTERN: THE TRADITIONAL FUNC
1911              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1912              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1913    
1914         The function pcre_exec() is called to match a subject string against  a         The  function pcre_exec() is called to match a subject string against a
1915         compiled  pattern, which is passed in the code argument. If the pattern         compiled pattern, which is passed in the code argument. If the  pattern
1916         was studied, the result of the study should  be  passed  in  the  extra         was  studied,  the  result  of  the study should be passed in the extra
1917         argument.  This  function is the main matching facility of the library,         argument. This function is the main matching facility of  the  library,
1918         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1919         an  alternative matching function, which is described below in the sec-         an alternative matching function, which is described below in the  sec-
1920         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1921    
1922         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
1923         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
1924         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1925         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
1926         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1927    
1928         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1927  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1941  MATCHING A PATTERN: THE TRADITIONAL FUNC
1941    
1942     Extra data for pcre_exec()     Extra data for pcre_exec()
1943    
1944         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1945         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1946         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
1947         tional  information  in it. The pcre_extra block contains the following         tional information in it. The pcre_extra block contains  the  following
1948         fields (not necessarily in this order):         fields (not necessarily in this order):
1949    
1950           unsigned long int flags;           unsigned long int flags;
# Line 1941  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1955  MATCHING A PATTERN: THE TRADITIONAL FUNC
1955           const unsigned char *tables;           const unsigned char *tables;
1956           unsigned char **mark;           unsigned char **mark;
1957    
1958         The flags field is a bitmap that specifies which of  the  other  fields         The  flags  field  is a bitmap that specifies which of the other fields
1959         are set. The flag bits are:         are set. The flag bits are:
1960    
1961           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1951  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1965  MATCHING A PATTERN: THE TRADITIONAL FUNC
1965           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1966           PCRE_EXTRA_MARK           PCRE_EXTRA_MARK
1967    
1968         Other  flag  bits should be set to zero. The study_data field is set in         Other flag bits should be set to zero. The study_data field is  set  in
1969         the pcre_extra block that is returned by  pcre_study(),  together  with         the  pcre_extra  block  that is returned by pcre_study(), together with
1970         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1971         add to the block by setting the other fields  and  their  corresponding         add  to  the  block by setting the other fields and their corresponding
1972         flag bits.         flag bits.
1973    
1974         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1975         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
1976         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
1977         search trees. The classic example is a pattern that uses nested  unlim-         search  trees. The classic example is a pattern that uses nested unlim-
1978         ited repeats.         ited repeats.
1979    
1980         Internally,  PCRE uses a function called match() which it calls repeat-         Internally, PCRE uses a function called match() which it calls  repeat-
1981         edly (sometimes recursively). The limit set by match_limit  is  imposed         edly  (sometimes  recursively). The limit set by match_limit is imposed
1982         on  the  number  of times this function is called during a match, which         on the number of times this function is called during  a  match,  which
1983         has the effect of limiting the amount of  backtracking  that  can  take         has  the  effect  of  limiting the amount of backtracking that can take
1984         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1985         for each position in the subject string.         for each position in the subject string.
1986    
1987         The default value for the limit can be set  when  PCRE  is  built;  the         The  default  value  for  the  limit can be set when PCRE is built; the
1988         default  default  is 10 million, which handles all but the most extreme         default default is 10 million, which handles all but the  most  extreme
1989         cases. You can override the default  by  suppling  pcre_exec()  with  a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1990         pcre_extra     block    in    which    match_limit    is    set,    and         pcre_extra    block    in    which    match_limit    is    set,     and
1991         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1992         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1993    
1994         The  match_limit_recursion field is similar to match_limit, but instead         The match_limit_recursion field is similar to match_limit, but  instead
1995         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1996         the  depth  of  recursion. The recursion depth is a smaller number than         the depth of recursion. The recursion depth is a  smaller  number  than
1997         the total number of calls, because not all calls to match() are  recur-         the  total number of calls, because not all calls to match() are recur-
1998         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1999    
2000         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
2001         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
2002         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
2003    
2004         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
2005         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
2006         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
2007         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
2008         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
2009         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
2010    
2011         The callout_data field is used in conjunction with the  "callout"  fea-         The  callout_data  field is used in conjunction with the "callout" fea-
2012         ture, and is described in the pcrecallout documentation.         ture, and is described in the pcrecallout documentation.
2013    
2014         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
2015         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
2016         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
2017         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
2018         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
2019         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
2020         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
2021         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
2022         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
2023         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
2024    
2025         If PCRE_EXTRA_MARK is set in the flags field, the mark  field  must  be         If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
2026         set  to  point  to a char * variable. If the pattern contains any back-         set to point to a char * variable. If the pattern  contains  any  back-
2027         tracking control verbs such as (*MARK:NAME), and the execution ends  up         tracking  control verbs such as (*MARK:NAME), and the execution ends up
2028         with  a  name  to  pass back, a pointer to the name string (zero termi-         with a name to pass back, a pointer to the  name  string  (zero  termi-
2029         nated) is placed in the variable pointed to  by  the  mark  field.  The         nated)  is  placed  in  the  variable pointed to by the mark field. The
2030         names  are  within  the  compiled pattern; if you wish to retain such a         names are within the compiled pattern; if you wish  to  retain  such  a
2031         name you must copy it before freeing the memory of a compiled  pattern.         name  you must copy it before freeing the memory of a compiled pattern.
2032         If  there  is no name to pass back, the variable pointed to by the mark         If there is no name to pass back, the variable pointed to by  the  mark
2033         field set to NULL. For details of the backtracking control  verbs,  see         field  set  to NULL. For details of the backtracking control verbs, see
2034         the section entitled "Backtracking control" in the pcrepattern documen-         the section entitled "Backtracking control" in the pcrepattern documen-
2035         tation.         tation.
2036    
2037     Option bits for pcre_exec()     Option bits for pcre_exec()
2038    
2039         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
2040         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
2041         PCRE_NOTBOL,   PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NOTEMPTY_ATSTART,         PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
2042         PCRE_NO_START_OPTIMIZE,   PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,  and         PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and
2043         PCRE_PARTIAL_HARD.         PCRE_PARTIAL_HARD.
2044    
2045           PCRE_ANCHORED           PCRE_ANCHORED
2046    
2047         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
2048         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
2049         turned out to be anchored by virtue of its contents, it cannot be  made         turned  out to be anchored by virtue of its contents, it cannot be made
2050         unachored at matching time.         unachored at matching time.
2051    
2052           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
2053           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
2054    
2055         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
2056         sequence matches. The choice is either to match only CR, LF,  or  CRLF,         sequence  matches.  The choice is either to match only CR, LF, or CRLF,
2057         or  to  match  any Unicode newline sequence. These options override the         or to match any Unicode newline sequence. These  options  override  the
2058         choice that was made or defaulted when the pattern was compiled.         choice that was made or defaulted when the pattern was compiled.
2059    
2060           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 2049  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2063  MATCHING A PATTERN: THE TRADITIONAL FUNC
2063           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
2064           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
2065    
2066         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
2067         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
2068         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
2069         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
2070         ters. It may also alter the way the match position is advanced after  a         ters.  It may also alter the way the match position is advanced after a
2071         match failure for an unanchored pattern.         match failure for an unanchored pattern.
2072    
2073         When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is         When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is
2074         set, and a match attempt for an unanchored pattern fails when the  cur-         set,  and a match attempt for an unanchored pattern fails when the cur-
2075         rent  position  is  at  a  CRLF  sequence,  and the pattern contains no         rent position is at a  CRLF  sequence,  and  the  pattern  contains  no
2076         explicit matches for  CR  or  LF  characters,  the  match  position  is         explicit  matches  for  CR  or  LF  characters,  the  match position is
2077         advanced by two characters instead of one, in other words, to after the         advanced by two characters instead of one, in other words, to after the
2078         CRLF.         CRLF.
2079    
2080         The above rule is a compromise that makes the most common cases work as         The above rule is a compromise that makes the most common cases work as
2081         expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL         expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL
2082         option is not set), it does not match the string "\r\nA" because, after         option is not set), it does not match the string "\r\nA" because, after
2083         failing  at the start, it skips both the CR and the LF before retrying.         failing at the start, it skips both the CR and the LF before  retrying.
2084         However, the pattern [\r\n]A does match that string,  because  it  con-         However,  the  pattern  [\r\n]A does match that string, because it con-
2085         tains an explicit CR or LF reference, and so advances only by one char-         tains an explicit CR or LF reference, and so advances only by one char-
2086         acter after the first failure.         acter after the first failure.
2087    
2088         An explicit match for CR of LF is either a literal appearance of one of         An explicit match for CR of LF is either a literal appearance of one of
2089         those  characters,  or  one  of the \r or \n escape sequences. Implicit         those characters, or one of the \r or  \n  escape  sequences.  Implicit
2090         matches such as [^X] do not count, nor does \s (which includes  CR  and         matches  such  as [^X] do not count, nor does \s (which includes CR and
2091         LF in the characters that it matches).         LF in the characters that it matches).
2092    
2093         Notwithstanding  the above, anomalous effects may still occur when CRLF         Notwithstanding the above, anomalous effects may still occur when  CRLF
2094         is a valid newline sequence and explicit \r or \n escapes appear in the         is a valid newline sequence and explicit \r or \n escapes appear in the
2095         pattern.         pattern.
2096    
2097           PCRE_NOTBOL           PCRE_NOTBOL
2098    
2099         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
2100         the beginning of a line, so the  circumflex  metacharacter  should  not         the  beginning  of  a  line, so the circumflex metacharacter should not
2101         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match before it. Setting this without PCRE_MULTILINE (at compile  time)
2102         causes circumflex never to match. This option affects only  the  behav-         causes  circumflex  never to match. This option affects only the behav-
2103         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
2104    
2105           PCRE_NOTEOL           PCRE_NOTEOL
2106    
2107         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
2108         of a line, so the dollar metacharacter should not match it nor  (except         of  a line, so the dollar metacharacter should not match it nor (except
2109         in  multiline mode) a newline immediately before it. Setting this with-         in multiline mode) a newline immediately before it. Setting this  with-
2110         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2111         option  affects only the behaviour of the dollar metacharacter. It does         option affects only the behaviour of the dollar metacharacter. It  does
2112         not affect \Z or \z.         not affect \Z or \z.
2113    
2114           PCRE_NOTEMPTY           PCRE_NOTEMPTY
2115    
2116         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
2117         set.  If  there are alternatives in the pattern, they are tried. If all         set. If there are alternatives in the pattern, they are tried.  If  all
2118         the alternatives match the empty string, the entire  match  fails.  For         the  alternatives  match  the empty string, the entire match fails. For
2119         example, if the pattern         example, if the pattern
2120    
2121           a?b?           a?b?
2122    
2123         is  applied  to  a  string not beginning with "a" or "b", it matches an         is applied to a string not beginning with "a" or  "b",  it  matches  an
2124         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
2125         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
2126         rences of "a" or "b".         rences of "a" or "b".
2127    
2128           PCRE_NOTEMPTY_ATSTART           PCRE_NOTEMPTY_ATSTART
2129    
2130         This is like PCRE_NOTEMPTY, except that an empty string match  that  is         This  is  like PCRE_NOTEMPTY, except that an empty string match that is
2131         not  at  the  start  of  the  subject  is  permitted. If the pattern is         not at the start of  the  subject  is  permitted.  If  the  pattern  is
2132         anchored, such a match can occur only if the pattern contains \K.         anchored, such a match can occur only if the pattern contains \K.
2133    
2134         Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
2135         PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern         PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
2136         match of the empty string within its split() function, and  when  using         match  of  the empty string within its split() function, and when using
2137         the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after         the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
2138         matching a null string by first trying the match again at the same off-         matching a null string by first trying the match again at the same off-
2139         set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that         set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
2140         fails, by advancing the starting offset (see below) and trying an ordi-         fails, by advancing the starting offset (see below) and trying an ordi-
2141         nary  match  again. There is some code that demonstrates how to do this         nary match again. There is some code that demonstrates how to  do  this
2142         in the pcredemo sample program. In the most general case, you  have  to         in  the  pcredemo sample program. In the most general case, you have to
2143         check  to  see  if the newline convention recognizes CRLF as a newline,         check to see if the newline convention recognizes CRLF  as  a  newline,
2144         and if so, and the current character is CR followed by LF, advance  the         and  if so, and the current character is CR followed by LF, advance the
2145         starting offset by two characters instead of one.         starting offset by two characters instead of one.
2146    
2147           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2148    
2149         There  are a number of optimizations that pcre_exec() uses at the start         There are a number of optimizations that pcre_exec() uses at the  start
2150         of a match, in order to speed up the process. For  example,  if  it  is         of  a  match,  in  order to speed up the process. For example, if it is
2151         known that an unanchored match must start with a specific character, it         known that an unanchored match must start with a specific character, it
2152         searches the subject for that character, and fails  immediately  if  it         searches  the  subject  for that character, and fails immediately if it
2153         cannot  find  it,  without actually running the main matching function.         cannot find it, without actually running the  main  matching  function.
2154         This means that a special item such as (*COMMIT) at the start of a pat-         This means that a special item such as (*COMMIT) at the start of a pat-
2155         tern  is  not  considered until after a suitable starting point for the         tern is not considered until after a suitable starting  point  for  the
2156         match has been found. When callouts or (*MARK) items are in use,  these         match  has been found. When callouts or (*MARK) items are in use, these
2157         "start-up" optimizations can cause them to be skipped if the pattern is         "start-up" optimizations can cause them to be skipped if the pattern is
2158         never actually used. The start-up optimizations are in  effect  a  pre-         never  actually  used.  The start-up optimizations are in effect a pre-
2159         scan of the subject that takes place before the pattern is run.         scan of the subject that takes place before the pattern is run.
2160    
2161         The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,         The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,
2162         possibly causing performance to suffer,  but  ensuring  that  in  cases         possibly  causing  performance  to  suffer,  but ensuring that in cases
2163         where  the  result is "no match", the callouts do occur, and that items         where the result is "no match", the callouts do occur, and  that  items
2164         such as (*COMMIT) and (*MARK) are considered at every possible starting         such as (*COMMIT) and (*MARK) are considered at every possible starting
2165         position  in  the  subject  string.  Setting PCRE_NO_START_OPTIMIZE can         position in the subject  string.   Setting  PCRE_NO_START_OPTIMIZE  can
2166         change the outcome of a matching operation.  Consider the pattern         change the outcome of a matching operation.  Consider the pattern
2167    
2168           (*COMMIT)ABC           (*COMMIT)ABC
2169    
2170         When this is compiled, PCRE records the fact that a  match  must  start         When  this  is  compiled, PCRE records the fact that a match must start
2171         with  the  character  "A".  Suppose the subject string is "DEFABC". The         with the character "A". Suppose the subject  string  is  "DEFABC".  The
2172         start-up optimization scans along the subject, finds "A" and  runs  the         start-up  optimization  scans along the subject, finds "A" and runs the
2173         first  match attempt from there. The (*COMMIT) item means that the pat-         first match attempt from there. The (*COMMIT) item means that the  pat-
2174         tern must match the current starting position, which in this  case,  it         tern  must  match the current starting position, which in this case, it
2175         does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE         does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
2176         set, the initial scan along the subject string  does  not  happen.  The         set,  the  initial  scan  along the subject string does not happen. The
2177         first  match  attempt  is  run  starting  from "D" and when this fails,         first match attempt is run starting  from  "D"  and  when  this  fails,
2178         (*COMMIT) prevents any further matches  being  tried,  so  the  overall         (*COMMIT)  prevents  any  further  matches  being tried, so the overall
2179         result  is  "no  match". If the pattern is studied, more start-up opti-         result is "no match". If the pattern is studied,  more  start-up  opti-
2180         mizations may be used. For example, a minimum length  for  the  subject         mizations  may  be  used. For example, a minimum length for the subject
2181         may be recorded. Consider the pattern         may be recorded. Consider the pattern
2182    
2183           (*MARK:A)(X|Y)           (*MARK:A)(X|Y)
2184    
2185         The  minimum  length  for  a  match is one character. If the subject is         The minimum length for a match is one  character.  If  the  subject  is
2186         "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then         "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
2187         finally  an empty string.  If the pattern is studied, the final attempt         finally an empty string.  If the pattern is studied, the final  attempt
2188         does not take place, because PCRE knows that the subject is too  short,         does  not take place, because PCRE knows that the subject is too short,
2189         and  so  the  (*MARK) is never encountered.  In this case, studying the         and so the (*MARK) is never encountered.  In this  case,  studying  the
2190         pattern does not affect the overall match result, which  is  still  "no         pattern  does  not  affect the overall match result, which is still "no
2191         match", but it does affect the auxiliary information that is returned.         match", but it does affect the auxiliary information that is returned.
2192    
2193           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2194    
2195         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2196         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2197         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
2198         points to the start of a UTF-8 character. There is a  discussion  about         points  to  the start of a UTF-8 character. There is a discussion about
2199         the  validity  of  UTF-8 strings in the section on UTF-8 support in the         the validity of UTF-8 strings in the section on UTF-8  support  in  the
2200         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,         main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2201         pcre_exec()  returns  the error PCRE_ERROR_BADUTF8. If startoffset con-         pcre_exec() returns  the  error  PCRE_ERROR_BADUTF8  or,  if  PCRE_PAR-
2202         tains a value that does not point to the start of a UTF-8 character (or         TIAL_HARD  is set and the problem is a truncated UTF-8 character at the
2203         to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.         end of the subject, PCRE_ERROR_SHORTUTF8.  If  startoffset  contains  a
2204           value  that does not point to the start of a UTF-8 character (or to the
2205         If  you  already  know that your subject is valid, and you want to skip         end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2206         these   checks   for   performance   reasons,   you   can    set    the  
2207         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         If you already know that your subject is valid, and you  want  to  skip
2208         do this for the second and subsequent calls to pcre_exec() if  you  are         these    checks    for   performance   reasons,   you   can   set   the
2209         making  repeated  calls  to  find  all  the matches in a single subject         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
2210         string. However, you should be  sure  that  the  value  of  startoffset         do  this  for the second and subsequent calls to pcre_exec() if you are
2211         points  to  the start of a UTF-8 character (or the end of the subject).         making repeated calls to find all  the  matches  in  a  single  subject
2212         When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid  UTF-8         string.  However,  you  should  be  sure  that the value of startoffset
2213         string  as  a  subject or an invalid value of startoffset is undefined.         points to the start of a UTF-8 character (or the end of  the  subject).
2214           When  PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8
2215           string as a subject or an invalid value of  startoffset  is  undefined.
2216         Your program may crash.         Your program may crash.
2217    
2218           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2219           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2220    
2221         These options turn on the partial matching feature. For backwards  com-         These  options turn on the partial matching feature. For backwards com-
2222         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
2223         match occurs if the end of the subject string is reached  successfully,         match  occurs if the end of the subject string is reached successfully,
2224         but  there  are not enough subject characters to complete the match. If         but there are not enough subject characters to complete the  match.  If
2225         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2226         matching  continues  by  testing any remaining alternatives. Only if no         matching continues by testing any remaining alternatives.  Only  if  no
2227         complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
2228         PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the         PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
2229         caller is prepared to handle a partial match, but only if  no  complete         caller  is  prepared to handle a partial match, but only if no complete
2230         match can be found.         match can be found.
2231    
2232         If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this         If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
2233         case, if a partial match  is  found,  pcre_exec()  immediately  returns         case,  if  a  partial  match  is found, pcre_exec() immediately returns
2234         PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In         PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
2235         other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-         other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
2236         ered to be more important that an alternative complete match.         ered to be more important that an alternative complete match.
2237    
2238         In  both  cases,  the portion of the string that was inspected when the         In both cases, the portion of the string that was  inspected  when  the
2239         partial match was found is set as the first matching string. There is a         partial match was found is set as the first matching string. There is a
2240         more  detailed  discussion  of partial and multi-segment matching, with         more detailed discussion of partial and  multi-segment  matching,  with
2241         examples, in the pcrepartial documentation.         examples, in the pcrepartial documentation.
2242    
2243     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2244    
2245         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
2246         length (in bytes) in length, and a starting byte offset in startoffset.         length (in bytes) in length, and a starting byte offset in startoffset.
2247         If this is  negative  or  greater  than  the  length  of  the  subject,         If  this  is  negative  or  greater  than  the  length  of the subject,
2248         pcre_exec() returns PCRE_ERROR_BADOFFSET.         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
2249           zero,  the  search  for a match starts at the beginning of the subject,
2250         In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-         and this is by far the most common case. In UTF-8 mode, the byte offset
2251         acter (or the end of the subject). Unlike the pattern string, the  sub-         must  point  to  the start of a UTF-8 character (or the end of the sub-
2252         ject  may  contain binary zero bytes. When the starting offset is zero,         ject). Unlike the pattern string, the subject may contain  binary  zero
2253         the search for a match starts at the beginning of the subject, and this         bytes.
        is by far the most common case.  
2254    
2255         A  non-zero  starting offset is useful when searching for another match         A  non-zero  starting offset is useful when searching for another match
2256         in the same subject by calling pcre_exec() again after a previous  suc-         in the same subject by calling pcre_exec() again after a previous  suc-
# Line 2339  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2354  MATCHING A PATTERN: THE TRADITIONAL FUNC
2354         expression are also set to -1. For example,  if  the  string  "abc"  is         expression are also set to -1. For example,  if  the  string  "abc"  is
2355         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2356         matched. The return from the function is 2, because  the  highest  used         matched. The return from the function is 2, because  the  highest  used
2357         capturing subpattern number is 1. However, you can refer to the offsets         capturing  subpattern  number  is 1, and the offsets for for the second
2358         for the second and third capturing subpatterns if  you  wish  (assuming         and third capturing subpatterns (assuming the vector is  large  enough,
2359         the vector is large enough, of course).         of course) are set to -1.
2360    
2361           Note: Elements of ovector that do not correspond to capturing parenthe-
2362           ses in the pattern are never changed. That is, if a pattern contains  n
2363           capturing parentheses, no more than ovector[0] to ovector[2n+1] are set
2364           by pcre_exec(). The other elements retain whatever values  they  previ-
2365           ously had.
2366    
2367         Some  convenience  functions  are  provided for extracting the captured         Some  convenience  functions  are  provided for extracting the captured
2368         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
# Line 2411  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2432  MATCHING A PATTERN: THE TRADITIONAL FUNC
2432           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2433    
2434         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2435         subject.         subject.   However,  if  PCRE_PARTIAL_HARD  is set and the problem is a
2436           truncated UTF-8 character at the end of the subject,  PCRE_ERROR_SHORT-
2437           UTF8 is used instead.
2438    
2439           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2440    
2441         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
2442         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2443         ter.         ter or the end of the subject.
2444    
2445           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2446    
# Line 2455  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2478  MATCHING A PATTERN: THE TRADITIONAL FUNC
2478         The value of startoffset was negative or greater than the length of the         The value of startoffset was negative or greater than the length of the
2479         subject, that is, the value in length.         subject, that is, the value in length.
2480    
2481             PCRE_ERROR_SHORTUTF8      (-25)
2482    
2483           The  subject  string ended with an incomplete (truncated) UTF-8 charac-
2484           ter, and the PCRE_PARTIAL_HARD option was  set.  Without  this  option,
2485           PCRE_ERROR_BADUTF8 is returned in this situation.
2486    
2487         Error numbers -16 to -20 and -22 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2488    
2489    
# Line 2833  AUTHOR Line 2862  AUTHOR
2862    
2863  REVISION  REVISION
2864    
2865         Last updated: 06 November 2010         Last updated: 13 November 2010
2866         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
2867  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2868    
# Line 3533  BACKSLASH Line 3562  BACKSLASH
3562         affects  \b,  and  \B  because  they are defined in terms of \w and \W.         affects  \b,  and  \B  because  they are defined in terms of \w and \W.
3563         Matching these sequences is noticeably slower when PCRE_UCP is set.         Matching these sequences is noticeably slower when PCRE_UCP is set.
3564    
3565         The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to         The sequences \h, \H, \v, and \V are features that were added  to  Perl
3566         the  other  sequences,  which  match  only ASCII characters by default,         at  release  5.10. In contrast to the other sequences, which match only
3567         these always  match  certain  high-valued  codepoints  in  UTF-8  mode,         ASCII characters by default, these  always  match  certain  high-valued
3568         whether or not PCRE_UCP is set. The horizontal space characters are:         codepoints  in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
3569           tal space characters are:
3570    
3571           U+0009     Horizontal tab           U+0009     Horizontal tab
3572           U+0020     Space           U+0020     Space
# Line 3570  BACKSLASH Line 3600  BACKSLASH
3600    
3601     Newline sequences     Newline sequences
3602    
3603         Outside  a  character class, by default, the escape sequence \R matches         Outside a character class, by default, the escape sequence  \R  matches
3604         any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
3605         mode \R is equivalent to the following:         following:
3606    
3607           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3608    
3609         This  is  an  example  of an "atomic group", details of which are given         This is an example of an "atomic group", details  of  which  are  given
3610         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3611         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
3612         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3613         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3614         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3615    
3616         In UTF-8 mode, two additional characters whose codepoints  are  greater         In  UTF-8  mode, two additional characters whose codepoints are greater
3617         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3618         rator, U+2029).  Unicode character property support is not  needed  for         rator,  U+2029).   Unicode character property support is not needed for
3619         these characters to be recognized.         these characters to be recognized.
3620    
3621         It is possible to restrict \R to match only CR, LF, or CRLF (instead of         It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3622         the complete set  of  Unicode  line  endings)  by  setting  the  option         the  complete  set  of  Unicode  line  endings)  by  setting the option
3623         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3624         (BSR is an abbrevation for "backslash R".) This can be made the default         (BSR is an abbrevation for "backslash R".) This can be made the default
3625         when  PCRE  is  built;  if this is the case, the other behaviour can be         when PCRE is built; if this is the case, the  other  behaviour  can  be
3626         requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to         requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
3627         specify  these  settings  by  starting a pattern string with one of the         specify these settings by starting a pattern string  with  one  of  the
3628         following sequences:         following sequences:
3629    
3630           (*BSR_ANYCRLF)   CR, LF, or CRLF only           (*BSR_ANYCRLF)   CR, LF, or CRLF only
3631           (*BSR_UNICODE)   any Unicode newline sequence           (*BSR_UNICODE)   any Unicode newline sequence
3632    
3633         These override the default and the options given to  pcre_compile()  or         These  override  the default and the options given to pcre_compile() or
3634         pcre_compile2(),  but  they  can  be  overridden  by  options  given to         pcre_compile2(), but  they  can  be  overridden  by  options  given  to
3635         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
3636         are  not  Perl-compatible,  are  recognized only at the very start of a         are not Perl-compatible, are recognized only at the  very  start  of  a
3637         pattern, and that they must be in upper case. If more than one of  them         pattern,  and that they must be in upper case. If more than one of them
3638         is present, the last one is used. They can be combined with a change of         is present, the last one is used. They can be combined with a change of
3639         newline convention; for example, a pattern can start with:         newline convention; for example, a pattern can start with:
3640    
3641           (*ANY)(*BSR_ANYCRLF)           (*ANY)(*BSR_ANYCRLF)
3642    
3643         They can also be combined with the (*UTF8) or (*UCP) special sequences.         They can also be combined with the (*UTF8) or (*UCP) special sequences.
3644         Inside  a  character  class,  \R  is  treated as an unrecognized escape         Inside a character class, \R  is  treated  as  an  unrecognized  escape
3645         sequence, and so matches the letter "R" by default, but causes an error         sequence, and so matches the letter "R" by default, but causes an error
3646         if PCRE_EXTRA is set.         if PCRE_EXTRA is set.
3647    
3648     Unicode character properties     Unicode character properties
3649    
3650         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3651         tional escape sequences that match characters with specific  properties         tional  escape sequences that match characters with specific properties
3652         are  available.   When not in UTF-8 mode, these sequences are of course         are available.  When not in UTF-8 mode, these sequences are  of  course
3653         limited to testing characters whose codepoints are less than  256,  but         limited  to  testing characters whose codepoints are less than 256, but
3654         they do work in this mode.  The extra escape sequences are:         they do work in this mode.  The extra escape sequences are:
3655    
3656           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3657           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3658           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3659    
3660         The  property  names represented by xx above are limited to the Unicode         The property names represented by xx above are limited to  the  Unicode
3661         script names, the general category properties, "Any", which matches any         script names, the general category properties, "Any", which matches any
3662         character   (including  newline),  and  some  special  PCRE  properties         character  (including  newline),  and  some  special  PCRE   properties
3663         (described in the next section).  Other Perl properties such as  "InMu-         (described  in the next section).  Other Perl properties such as "InMu-
3664         sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}         sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}
3665         does not match any characters, so always causes a match failure.         does not match any characters, so always causes a match failure.
3666    
3667         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3668         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3669         For example:         For example:
3670    
3671           \p{Greek}           \p{Greek}
3672           \P{Han}           \P{Han}
3673    
3674         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3675         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3676    
3677         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
3678         Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,         Buginese,  Buhid,  Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
3679         Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp-         Coptic,  Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,   Egyp-
3680         tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,         tian_Hieroglyphs,   Ethiopic,   Georgian,  Glagolitic,  Gothic,  Greek,
3681         Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe-         Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana,  Impe-
3682         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
3683         Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,         Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer,  Lao,
3684         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
3685         Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,         Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham,  Old_Italic,
3686         Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,         Old_Persian,  Old_South_Arabian,  Old_Turkic, Ol_Chiki, Oriya, Osmanya,
3687         Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,         Phags_Pa, Phoenician, Rejang, Runic,  Samaritan,  Saurashtra,  Shavian,
3688         Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,         Sinhala,  Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le,
3689         Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,         Tai_Tham, Tai_Viet, Tamil, Telugu,  Thaana,  Thai,  Tibetan,  Tifinagh,
3690         Ugaritic, Vai, Yi.         Ugaritic, Vai, Yi.
3691    
3692         Each character has exactly one Unicode general category property, spec-         Each character has exactly one Unicode general category property, spec-
3693         ified  by a two-letter abbreviation. For compatibility with Perl, nega-         ified by a two-letter abbreviation. For compatibility with Perl,  nega-
3694         tion can be specified by including a  circumflex  between  the  opening         tion  can  be  specified  by including a circumflex between the opening
3695         brace  and  the  property  name.  For  example,  \p{^Lu} is the same as         brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
3696         \P{Lu}.         \P{Lu}.
3697    
3698         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3699         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3700         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3701         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3702    
3703           \p{L}           \p{L}
# Line 3719  BACKSLASH Line 3749  BACKSLASH
3749           Zp    Paragraph separator           Zp    Paragraph separator
3750           Zs    Space separator           Zs    Space separator
3751    
3752         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3753         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3754         classified as a modifier or "other".         classified as a modifier or "other".
3755    
3756         The  Cs  (Surrogate)  property  applies only to characters in the range         The Cs (Surrogate) property applies only to  characters  in  the  range
3757         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see         U+D800  to  U+DFFF. Such characters are not valid in UTF-8 strings (see
3758         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
3759         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in         ing  has  been  turned off (see the discussion of PCRE_NO_UTF8_CHECK in
3760         the pcreapi page). Perl does not support the Cs property.         the pcreapi page). Perl does not support the Cs property.
3761    
3762         The  long  synonyms  for  property  names  that  Perl supports (such as         The long synonyms for  property  names  that  Perl  supports  (such  as
3763         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3764         any of these properties with "Is".         any of these properties with "Is".
3765    
3766         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3767         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3768         in the Unicode table.         in the Unicode table.
3769    
3770         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3771         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3772    
3773         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3774         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3775    
3776           (?>\PM\pM*)           (?>\PM\pM*)
3777    
3778         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3779         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3780         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3781         property are typically accents that  affect  the  preceding  character.         property  are  typically  accents  that affect the preceding character.
3782         None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X         None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3783         matches any one character.         matches any one character.
3784    
3785         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3786         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3787         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3788         \w  do  not  use  Unicode properties in PCRE by default, though you can         \w do not use Unicode properties in PCRE by  default,  though  you  can
3789         make them do so by setting the PCRE_UCP option for pcre_compile() or by         make them do so by setting the PCRE_UCP option for pcre_compile() or by
3790         starting the pattern with (*UCP).         starting the pattern with (*UCP).
3791    
3792     PCRE's additional properties     PCRE's additional properties
3793    
3794         As  well  as  the standard Unicode properties described in the previous         As well as the standard Unicode properties described  in  the  previous
3795         section, PCRE supports four more that make it possible to convert  tra-         section,  PCRE supports four more that make it possible to convert tra-
3796         ditional escape sequences such as \w and \s and POSIX character classes         ditional escape sequences such as \w and \s and POSIX character classes
3797         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
3798         erties internally when PCRE_UCP is set. They are:         erties internally when PCRE_UCP is set. They are:
# Line 3772  BACKSLASH Line 3802  BACKSLASH
3802           Xsp   Any Perl space character           Xsp   Any Perl space character
3803           Xwd   Any Perl "word" character           Xwd   Any Perl "word" character
3804    
3805         Xan  matches  characters that have either the L (letter) or the N (num-         Xan matches characters that have either the L (letter) or the  N  (num-
3806         ber) property. Xps matches the characters tab, linefeed, vertical  tab,         ber)  property. Xps matches the characters tab, linefeed, vertical tab,
3807         formfeed,  or  carriage  return, and any other character that has the Z         formfeed, or carriage return, and any other character that  has  the  Z
3808         (separator) property.  Xsp is the same as Xps, except that vertical tab         (separator) property.  Xsp is the same as Xps, except that vertical tab
3809         is excluded. Xwd matches the same characters as Xan, plus underscore.         is excluded. Xwd matches the same characters as Xan, plus underscore.
3810    
3811     Resetting the match start     Resetting the match start
3812    
3813         The escape sequence \K, which is a Perl 5.10 feature, causes any previ-         The escape sequence \K causes any previously matched characters not  to
3814         ously matched characters not  to  be  included  in  the  final  matched         be included in the final matched sequence. For example, the pattern:
        sequence. For example, the pattern:  
3815    
3816           foo\Kbar           foo\Kbar
3817    
# Line 3938  FULL STOP (PERIOD, DOT) AND \N Line 3967  FULL STOP (PERIOD, DOT) AND \N
3967         flex and dollar, the only relationship being  that  they  both  involve         flex and dollar, the only relationship being  that  they  both  involve
3968         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
3969    
3970         The escape sequence \N always behaves as a dot does when PCRE_DOTALL is         The  escape  sequence  \N  behaves  like  a  dot, except that it is not
3971         not set. In other words, it matches any one character except  one  that         affected by the PCRE_DOTALL option. In  other  words,  it  matches  any
3972         signifies the end of a line.         character except one that signifies the end of a line.
3973    
3974    
3975  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
# Line 3949  MATCHING A SINGLE BYTE Line 3978  MATCHING A SINGLE BYTE
3978         both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any         both in and out of UTF-8 mode. Unlike a  dot,  it  always  matches  any
3979         line-ending  characters.  The  feature  is provided in Perl in order to         line-ending  characters.  The  feature  is provided in Perl in order to
3980         match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-         match individual bytes in UTF-8 mode. Because it breaks up UTF-8  char-
3981         acters  into individual bytes, what remains in the string may be a mal-         acters  into  individual bytes, the rest of the string may start with a
3982         formed UTF-8 string. For this reason, the \C escape  sequence  is  best         malformed UTF-8 character. For this reason, the \C escape  sequence  is
3983         avoided.         best avoided.
3984    
3985         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
3986         below), because in UTF-8 mode this would make it impossible  to  calcu-         below), because in UTF-8 mode this would make it impossible  to  calcu-
# Line 4157  INTERNAL OPTION SETTING Line 4186  INTERNAL OPTION SETTING
4186         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4187    
4188         An  option  change  within a subpattern (see below for a description of         An  option  change  within a subpattern (see below for a description of
4189         subpatterns) affects only that part of the current pattern that follows         subpatterns) affects only that part of the subpattern that follows  it,
4190         it, so         so
4191    
4192           (a(?i)b)c           (a(?i)b)c
4193    
# Line 4194  SUBPATTERNS Line 4223  SUBPATTERNS
4223    
4224           cat(aract|erpillar|)           cat(aract|erpillar|)
4225    
4226         matches one of the words "cat", "cataract", or  "caterpillar".  Without         matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
4227         the  parentheses,  it  would  match  "cataract", "erpillar" or an empty         it would match "cataract", "erpillar" or an empty string.
        string.  
4228    
4229         2. It sets up the subpattern as  a  capturing  subpattern.  This  means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
4230         that,  when  the  whole  pattern  matches,  that portion of the subject         that, when the whole pattern  matches,  that  portion  of  the  subject
4231         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4232         ovector  argument  of pcre_exec(). Opening parentheses are counted from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
4233         left to right (starting from 1) to obtain  numbers  for  the  capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
4234         subpatterns.         subpatterns. For example, if the  string  "the  red  king"  is  matched
4235           against the pattern
        For  example,  if the string "the red king" is matched against the pat-  
        tern  
4236    
4237           the ((red|white) (king|queen))           the ((red|white) (king|queen))
4238    
4239         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
4240         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
4241    
4242         The  fact  that  plain  parentheses  fulfil two functions is not always         The fact that plain parentheses fulfil  two  functions  is  not  always
4243         helpful.  There are often times when a grouping subpattern is  required         helpful.   There are often times when a grouping subpattern is required
4244         without  a capturing requirement. If an opening parenthesis is followed         without a capturing requirement. If an opening parenthesis is  followed
4245         by a question mark and a colon, the subpattern does not do any  captur-         by  a question mark and a colon, the subpattern does not do any captur-
4246         ing,  and  is  not  counted when computing the number of any subsequent         ing, and is not counted when computing the  number  of  any  subsequent
4247         capturing subpatterns. For example, if the string "the white queen"  is         capturing  subpatterns. For example, if the string "the white queen" is
4248         matched against the pattern         matched against the pattern
4249    
4250           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
# Line 4226  SUBPATTERNS Line 4252  SUBPATTERNS
4252         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
4253         1 and 2. The maximum number of capturing subpatterns is 65535.         1 and 2. The maximum number of capturing subpatterns is 65535.
4254    
4255         As a convenient shorthand, if any option settings are required  at  the         As  a  convenient shorthand, if any option settings are required at the
4256         start  of  a  non-capturing  subpattern,  the option letters may appear         start of a non-capturing subpattern,  the  option  letters  may  appear
4257         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
4258    
4259           (?i:saturday|sunday)           (?i:saturday|sunday)
4260           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
4261    
4262         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
4263         tried  from  left  to right, and options are not reset until the end of         tried from left to right, and options are not reset until  the  end  of
4264         the subpattern is reached, an option setting in one branch does  affect         the  subpattern is reached, an option setting in one branch does affect
4265         subsequent  branches,  so  the above patterns match "SUNDAY" as well as         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
4266         "Saturday".         "Saturday".
4267    
4268    
4269  DUPLICATE SUBPATTERN NUMBERS  DUPLICATE SUBPATTERN NUMBERS
4270    
4271         Perl 5.10 introduced a feature whereby each alternative in a subpattern         Perl 5.10 introduced a feature whereby each alternative in a subpattern
4272         uses  the same numbers for its capturing parentheses. Such a subpattern         uses the same numbers for its capturing parentheses. Such a  subpattern
4273         starts with (?| and is itself a non-capturing subpattern. For  example,         starts  with (?| and is itself a non-capturing subpattern. For example,
4274         consider this pattern:         consider this pattern:
4275    
4276           (?|(Sat)ur|(Sun))day           (?|(Sat)ur|(Sun))day
4277    
4278         Because  the two alternatives are inside a (?| group, both sets of cap-         Because the two alternatives are inside a (?| group, both sets of  cap-
4279         turing parentheses are numbered one. Thus, when  the  pattern  matches,         turing  parentheses  are  numbered one. Thus, when the pattern matches,
4280         you  can  look  at captured substring number one, whichever alternative         you can look at captured substring number  one,  whichever  alternative
4281         matched. This construct is useful when you want to  capture  part,  but         matched.  This  construct  is useful when you want to capture part, but
4282         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
4283         theses are numbered as usual, but the number is reset at the  start  of         theses  are  numbered as usual, but the number is reset at the start of
4284         each  branch. The numbers of any capturing buffers that follow the sub-         each branch. The numbers of any capturing parentheses that  follow  the
4285         pattern start after the highest number used in any branch. The  follow-         subpattern  start after the highest number used in any branch. The fol-
4286         ing  example  is taken from the Perl documentation.  The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
4287         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
4288    
4289           # before  ---------------branch-reset----------- after           # before  ---------------branch-reset----------- after
4290           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
4291           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
4292    
4293         A back reference to a numbered subpattern uses the  most  recent  value         A  back  reference  to a numbered subpattern uses the most recent value
4294         that  is  set  for that number by any subpattern. The following pattern         that is set for that number by any subpattern.  The  following  pattern
4295         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
4296    
4297           /(?|(abc)|(def))\1/           /(?|(abc)|(def))\1/
4298    
4299         In contrast, a recursive or "subroutine" call to a numbered  subpattern         In  contrast, a recursive or "subroutine" call to a numbered subpattern
4300         always  refers  to  the first one in the pattern with the given number.         always refers to the first one in the pattern with  the  given  number.
4301         The following pattern matches "abcabc" or "defabc":         The following pattern matches "abcabc" or "defabc":
4302    
4303           /(?|(abc)|(def))(?1)/           /(?|(abc)|(def))(?1)/
4304    
4305         If a condition test for a subpattern's having matched refers to a  non-         If  a condition test for a subpattern's having matched refers to a non-
4306         unique  number, the test is true if any of the subpatterns of that num-         unique number, the test is true if any of the subpatterns of that  num-
4307         ber have matched.         ber have matched.
4308    
4309         An alternative approach to using this "branch reset" feature is to  use         An  alternative approach to using this "branch reset" feature is to use
4310         duplicate named subpatterns, as described in the next section.         duplicate named subpatterns, as described in the next section.
4311    
4312    
4313  NAMED SUBPATTERNS  NAMED SUBPATTERNS
4314    
4315         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
4316         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
4317         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
4318         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
4319         patterns. This feature was not added to Perl until release 5.10. Python         patterns. This feature was not added to Perl until release 5.10. Python
4320         had the feature earlier, and PCRE introduced it at release  4.0,  using         had  the  feature earlier, and PCRE introduced it at release 4.0, using
4321         the  Python syntax. PCRE now supports both the Perl and the Python syn-         the Python syntax. PCRE now supports both the Perl and the Python  syn-
4322         tax. Perl allows identically numbered  subpatterns  to  have  different         tax.  Perl  allows  identically  numbered subpatterns to have different
4323         names, but PCRE does not.         names, but PCRE does not.
4324    
4325         In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
4326         or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
4327         to  capturing parentheses from other parts of the pattern, such as back         to capturing parentheses from other parts of the pattern, such as  back
4328         references, recursion, and conditions, can be made by name as  well  as         references,  recursion,  and conditions, can be made by name as well as
4329         by number.         by number.
4330    
4331         Names  consist  of  up  to  32 alphanumeric characters and underscores.         Names consist of up to  32  alphanumeric  characters  and  underscores.
4332         Named capturing parentheses are still  allocated  numbers  as  well  as         Named  capturing  parentheses  are  still  allocated numbers as well as
4333         names,  exactly as if the names were not present. The PCRE API provides         names, exactly as if the names were not present. The PCRE API  provides
4334         function calls for extracting the name-to-number translation table from         function calls for extracting the name-to-number translation table from
4335         a compiled pattern. There is also a convenience function for extracting         a compiled pattern. There is also a convenience function for extracting
4336         a captured substring by name.         a captured substring by name.
4337    
4338         By default, a name must be unique within a pattern, but it is  possible         By  default, a name must be unique within a pattern, but it is possible
4339         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
4340         time. (Duplicate names are also always permitted for  subpatterns  with         time.  (Duplicate  names are also always permitted for subpatterns with
4341         the  same  number, set up as described in the previous section.) Dupli-         the same number, set up as described in the previous  section.)  Dupli-
4342         cate names can be useful for patterns where only one  instance  of  the         cate  names  can  be useful for patterns where only one instance of the
4343         named  parentheses  can  match. Suppose you want to match the name of a         named parentheses can match. Suppose you want to match the  name  of  a
4344         weekday, either as a 3-letter abbreviation or as the full name, and  in         weekday,  either as a 3-letter abbreviation or as the full name, and in
4345         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
4346         the line breaks) does the job:         the line breaks) does the job:
4347    
# Line 4325  NAMED SUBPATTERNS Line 4351  NAMED SUBPATTERNS
4351           (?<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
4352           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
4353    
4354         There are five capturing substrings, but only one is ever set  after  a         There  are  five capturing substrings, but only one is ever set after a
4355         match.  (An alternative way of solving this problem is to use a "branch         match.  (An alternative way of solving this problem is to use a "branch
4356         reset" subpattern, as described in the previous section.)         reset" subpattern, as described in the previous section.)
4357    
4358         The convenience function for extracting the data by  name  returns  the         The  convenience  function  for extracting the data by name returns the
4359         substring  for  the first (and in this example, the only) subpattern of         substring for the first (and in this example, the only)  subpattern  of
4360         that name that matched. This saves searching  to  find  which  numbered         that  name  that  matched.  This saves searching to find which numbered
4361         subpattern it was.         subpattern it was.
4362    
4363         If  you  make  a  back  reference to a non-unique named subpattern from         If you make a back reference to  a  non-unique  named  subpattern  from
4364         elsewhere in the pattern, the one that corresponds to the first  occur-         elsewhere  in the pattern, the one that corresponds to the first occur-
4365         rence of the name is used. In the absence of duplicate numbers (see the         rence of the name is used. In the absence of duplicate numbers (see the
4366         previous section) this is the one with the lowest number. If you use  a         previous  section) this is the one with the lowest number. If you use a
4367         named  reference  in a condition test (see the section about conditions         named reference in a condition test (see the section  about  conditions
4368         below), either to check whether a subpattern has matched, or  to  check         below),  either  to check whether a subpattern has matched, or to check
4369         for  recursion,  all  subpatterns with the same name are tested. If the         for recursion, all subpatterns with the same name are  tested.  If  the
4370         condition is true for any one of them, the overall condition  is  true.         condition  is  true for any one of them, the overall condition is true.
4371         This is the same behaviour as testing by number. For further details of         This is the same behaviour as testing by number. For further details of
4372         the interfaces for handling named subpatterns, see the pcreapi documen-         the interfaces for handling named subpatterns, see the pcreapi documen-
4373         tation.         tation.
4374    
4375         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
4376         patterns with the same number because PCRE uses only the  numbers  when         patterns  with  the same number because PCRE uses only the numbers when
4377         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
4378         ent names are given to subpatterns with the same number.  However,  you         ent  names  are given to subpatterns with the same number. However, you
4379         can  give  the same name to subpatterns with the same number, even when         can give the same name to subpatterns with the same number,  even  when
4380         PCRE_DUPNAMES is not set.         PCRE_DUPNAMES is not set.
4381    
4382    
4383  REPETITION  REPETITION
4384    
4385         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
4386         following items:         following items:
4387    
4388           a literal data character           a literal data character
# Line 4364  REPETITION Line 4390  REPETITION
4390           the \C escape sequence           the \C escape sequence
4391           the \X escape sequence (in UTF-8 mode with Unicode properties)           the \X escape sequence (in UTF-8 mode with Unicode properties)
4392           the \R escape sequence           the \R escape sequence
4393           an escape such as \d that matches a single character           an escape such as \d or \pL that matches a single character
4394           a character class           a character class
4395           a back reference (see next section)           a back reference (see next section)
4396           a parenthesized subpattern (unless it is an assertion)           a parenthesized subpattern (unless it is an assertion)
4397           a recursive or "subroutine" call to a subpattern           a recursive or "subroutine" call to a subpattern
4398    
4399         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
4400         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
4401         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
4402         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
4403    
4404           z{2,4}           z{2,4}
4405    
4406         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
4407         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
4408         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
4409         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
4410         matches. Thus         matches. Thus
4411    
4412           [aeiou]{3,}           [aeiou]{3,}
# Line 4389  REPETITION Line 4415  REPETITION
4415    
4416           \d{8}           \d{8}
4417    
4418         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
4419         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
4420         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
4421         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
4422    
4423         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
4424         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
4425         acters, each of which is represented by a two-byte sequence. Similarly,         acters, each of which is represented by a two-byte sequence. Similarly,
4426         when Unicode property support is available, \X{3} matches three Unicode         when Unicode property support is available, \X{3} matches three Unicode
4427         extended  sequences,  each of which may be several bytes long (and they         extended sequences, each of which may be several bytes long  (and  they
4428         may be of different lengths).         may be of different lengths).
4429    
4430         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
4431         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
4432         ful for subpatterns that are referenced as subroutines  from  elsewhere         ful  for  subpatterns that are referenced as subroutines from elsewhere
4433         in the pattern. Items other than subpatterns that have a {0} quantifier         in the pattern (but see also the section entitled "Defining subpatterns
4434         are omitted from the compiled pattern.         for  use  by  reference only" below). Items other than subpatterns that
4435           have a {0} quantifier are omitted from the compiled pattern.
4436    
4437         For convenience, the three most common quantifiers have  single-charac-         For convenience, the three most common quantifiers have  single-charac-
4438         ter abbreviations:         ter abbreviations:
# Line 4636  BACK REFERENCES Line 4663  BACK REFERENCES
4663         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
4664    
4665         Another way of avoiding the ambiguity inherent in  the  use  of  digits         Another way of avoiding the ambiguity inherent in  the  use  of  digits
4666         following a backslash is to use the \g escape sequence, which is a fea-         following  a  backslash  is  to use the \g escape sequence. This escape
4667         ture introduced in Perl 5.10.  This  escape  must  be  followed  by  an         must be followed by an unsigned number or a negative number, optionally
4668         unsigned  number  or  a negative number, optionally enclosed in braces.         enclosed in braces. These examples are all identical:
        These examples are all identical:  
4669    
4670           (ring), \1           (ring), \1
4671           (ring), \g1           (ring), \g1
4672           (ring), \g{1}           (ring), \g{1}
4673    
4674         An unsigned number specifies an absolute reference without the  ambigu-         An  unsigned number specifies an absolute reference without the ambigu-
4675         ity that is present in the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
4676         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
4677         Consider this example:         Consider this example:
# Line 4653  BACK REFERENCES Line 4679  BACK REFERENCES
4679           (abc(def)ghi)\g{-1}           (abc(def)ghi)\g{-1}
4680    
4681         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
4682         ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,         ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
4683         \g{-2} would be equivalent to \1. The use of relative references can be         \g{-2} would be equivalent to \1. The use of relative references can be
4684         helpful in long patterns, and also in  patterns  that  are  created  by         helpful  in  long  patterns,  and  also in patterns that are created by
4685         joining together fragments that contain references within themselves.         joining together fragments that contain references within themselves.
4686    
4687         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
4688         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
4689         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
4690         of doing that). So the pattern         of doing that). So the pattern
4691    
4692           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4693    
4694         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
4695         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
4696         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
4697         ple,         ple,
4698    
4699           ((?i)rah)\s+\1           ((?i)rah)\s+\1
4700    
4701         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
4702         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
4703    
4704         There are several different ways of writing back  references  to  named         There  are  several  different ways of writing back references to named
4705         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
4706         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
4707         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
4708         and named references, is also supported. We  could  rewrite  the  above         and  named  references,  is  also supported. We could rewrite the above
4709         example in any of the following ways:         example in any of the following ways:
4710    
4711           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 4687  BACK REFERENCES Line 4713  BACK REFERENCES
4713           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4714           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
4715    
4716         A  subpattern  that  is  referenced  by  name may appear in the pattern         A subpattern that is referenced by  name  may  appear  in  the  pattern
4717         before or after the reference.         before or after the reference.
4718    
4719         There may be more than one back reference to the same subpattern. If  a         There  may be more than one back reference to the same subpattern. If a
4720         subpattern  has  not actually been used in a particular match, any back         subpattern has not actually been used in a particular match,  any  back
4721         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
4722    
4723           (a|(bc))\2           (a|(bc))\2
4724    
4725         always fails if it starts to match "a" rather than  "bc".  However,  if         always  fails  if  it starts to match "a" rather than "bc". However, if
4726         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
4727         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
4728    
4729         Because there may be many capturing parentheses in a pattern, all  dig-         Because  there may be many capturing parentheses in a pattern, all dig-
4730         its  following a backslash are taken as part of a potential back refer-         its following a backslash are taken as part of a potential back  refer-
4731         ence number.  If the pattern continues with  a  digit  character,  some         ence  number.   If  the  pattern continues with a digit character, some
4732         delimiter  must  be  used  to  terminate  the  back  reference.  If the         delimiter must  be  used  to  terminate  the  back  reference.  If  the
4733         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
4734         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
4735    
4736     Recursive back references     Recursive back references
4737    
4738         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
4739         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
4740         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
4741         patterns. For example, the pattern         patterns. For example, the pattern
4742    
4743           (a|b\1)+           (a|b\1)+
4744    
4745         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4746         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
4747         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
4748         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
4749         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
4750         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4751    
4752         Back  references of this type cause the group that they reference to be         Back references of this type cause the group that they reference to  be
4753         treated as an atomic group.  Once the whole group has been  matched,  a         treated  as  an atomic group.  Once the whole group has been matched, a
4754         subsequent  matching  failure cannot cause backtracking into the middle         subsequent matching failure cannot cause backtracking into  the  middle
4755         of the group.         of the group.
4756    
4757    
4758  ASSERTIONS  ASSERTIONS
4759    
4760         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
4761         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
4762         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
4763         described above.         described above.
4764    
4765         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
4766         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
4767         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
4768         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
4769         matching position to be changed.         matching position to be changed.
4770    
4771         Assertion  subpatterns  are  not  capturing subpatterns, and may not be         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
4772         repeated, because it makes no sense to assert the  same  thing  several         repeated,  because  it  makes no sense to assert the same thing several
4773         times.  If  any kind of assertion contains capturing subpatterns within         times. If any kind of assertion contains capturing  subpatterns  within
4774         it, these are counted for the purposes of numbering the capturing  sub-         it,  these are counted for the purposes of numbering the capturing sub-
4775         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4776         out only for positive assertions, because it does not  make  sense  for         out  only  for  positive assertions, because it does not make sense for
4777         negative assertions.         negative assertions.
4778    
4779     Lookahead assertions     Lookahead assertions
# Line 4757  ASSERTIONS Line 4783  ASSERTIONS
4783    
4784           \w+(?=;)           \w+(?=;)
4785    
4786         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
4787         colon in the match, and         colon in the match, and
4788    
4789           foo(?!bar)           foo(?!bar)
4790    
4791         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
4792         that the apparently similar pattern         that the apparently similar pattern
4793    
4794           (?!foo)bar           (?!foo)bar
4795    
4796         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
4797         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
4798         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4799         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4800    
4801         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4802         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
4803         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
4804         string must always fail.   The  Perl  5.10  backtracking  control  verb         string must always fail.  The backtracking control verb (*FAIL) or (*F)
4805         (*FAIL) or (*F) is essentially a synonym for (?!).         is essentially a synonym for (?!).
4806    
4807     Lookbehind assertions     Lookbehind assertions
4808    
4809         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
4810         for negative assertions. For example,         for negative assertions. For example,
4811    
4812           (?<!foo)bar           (?<!foo)bar
4813    
4814         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4815         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4816         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4817         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4818         fixed length. Thus         fixed length. Thus
4819    
4820           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 4797  ASSERTIONS Line 4823  ASSERTIONS
4823    
4824           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4825    
4826         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4827         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4828         This is an extension compared with Perl (5.8 and 5.10), which  requires         This is an extension compared with Perl, which requires all branches to
4829         all branches to match the same length of string. An assertion such as         match the same length of string. An assertion such as
4830    
4831           (?<=ab(c|de))           (?<=ab(c|de))
4832    
4833         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
4834         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
4835         top-level branches:         top-level branches:
4836    
4837           (?<=abc|abde)           (?<=abc|abde)
4838    
4839         In some cases, the Perl 5.10 escape sequence \K (see above) can be used         In some cases, the escape sequence \K (see above) can be  used  instead
4840         instead of  a  lookbehind  assertion  to  get  round  the  fixed-length         of a lookbehind assertion to get round the fixed-length restriction.
        restriction.  
4841    
4842         The  implementation  of lookbehind assertions is, for each alternative,         The  implementation  of lookbehind assertions is, for each alternative,
4843         to temporarily move the current position back by the fixed  length  and         to temporarily move the current position back by the fixed  length  and
# Line 5048  COMMENTS Line 5073  COMMENTS
5073         ters are interpreted as newlines is controlled by the options passed to         ters are interpreted as newlines is controlled by the options passed to
5074         pcre_compile() or by a special sequence at the start of the pattern, as         pcre_compile() or by a special sequence at the start of the pattern, as
5075         described in the section entitled  "Newline  conventions"  above.  Note         described in the section entitled  "Newline  conventions"  above.  Note
5076         that  end  of this type of comment is a literal newline sequence in the         that  the  end of this type of comment is a literal newline sequence in
5077         pattern; escape sequences that happen to represent  a  newline  do  not         the pattern; escape sequences that happen to represent a newline do not
5078         count.   For  example, consider this pattern when PCRE_EXTENDED is set,         count.  For  example,  consider this pattern when PCRE_EXTENDED is set,
5079         and the default newline convention is in force:         and the default newline convention is in force:
5080    
5081           abc #comment \n still comment           abc #comment \n still comment
# Line 5114  RECURSIVE PATTERNS Line 5139  RECURSIVE PATTERNS
5139         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5140    
5141         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5142         tricky. This is made easier by the use of relative references  (a  Perl         tricky. This is made easier by the use of relative references.  Instead
5143         5.10  feature).   Instead  of  (?1)  in the pattern above you can write         of (?1) in the pattern above you can write (?-2) to refer to the second
5144         (?-2) to refer to the second most recently opened parentheses preceding         most recently opened parentheses  preceding  the  recursion.  In  other
5145         the  recursion.  In  other  words,  a  negative number counts capturing         words,  a  negative  number counts capturing parentheses leftwards from
5146         parentheses leftwards from the point at which it is encountered.         the point at which it is encountered.
5147    
5148         It is also possible to refer to  subsequently  opened  parentheses,  by         It is also possible to refer to  subsequently  opened  parentheses,  by
5149         writing  references  such  as (?+2). However, these cannot be recursive         writing  references  such  as (?+2). However, these cannot be recursive
# Line 5624  AUTHOR Line 5649  AUTHOR
5649    
5650  REVISION  REVISION
5651    
5652         Last updated: 31 October 2010         Last updated: 17 November 2010
5653         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
5654  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5655    
# Line 6117  PARTIAL MATCHING USING pcre_exec() Line 6142  PARTIAL MATCHING USING pcre_exec()
6142         or  $  are  encountered  at  the  end  of  the  subject,  the result is         or  $  are  encountered  at  the  end  of  the  subject,  the result is
6143         PCRE_ERROR_PARTIAL.         PCRE_ERROR_PARTIAL.
6144    
6145           Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8
6146           subject  strings  for  validity.  Normally,  an  invalid UTF-8 sequence
6147           causes the error PCRE_ERROR_BADUTF8. However, in the special case of  a
6148           truncated  UTF-8 character at the end of the subject, PCRE_ERROR_SHORT-
6149           UTF8 is returned when PCRE_PARTIAL_HARD is set.
6150    
6151     Comparing hard and soft partial matching     Comparing hard and soft partial matching
6152    
6153         The difference between the two partial matching options can  be  illus-         The difference between the two partial matching options can  be  illus-
# Line 6361  ISSUES WITH MULTI-SEGMENT MATCHING Line 6392  ISSUES WITH MULTI-SEGMENT MATCHING
6392           data> gsb\R\P\P\D           data> gsb\R\P\P\D
6393           Partial match: gsb           Partial match: gsb
6394    
   
6395         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
6396         start  with  the  same  pattern  item  may  not  work  as expected when         start  with  the  same  pattern  item  may  not  work  as expected when
6397         PCRE_DFA_RESTART is used with pcre_dfa_exec().  For  example,  consider         PCRE_DFA_RESTART is used with pcre_dfa_exec().  For  example,  consider
# Line 6408  AUTHOR Line 6438  AUTHOR
6438    
6439  REVISION  REVISION
6440    
6441         Last updated: 22 October 2010         Last updated: 07 November 2010
6442         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6443  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6444    

Legend:
Removed from v.567  
changed lines
  Added in v.572

  ViewVC Help
Powered by ViewVC 1.1.5