/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 43 by nigel, Sat Feb 24 21:39:21 2007 UTC revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC
# Line 353  INFORMATION ABOUT A PATTERN Line 353  INFORMATION ABOUT A PATTERN
353       Return information about the first character of any  matched       Return information about the first character of any  matched
354       string,  for  a  non-anchored  pattern.  If there is a fixed       string,  for  a  non-anchored  pattern.  If there is a fixed
355       first   character,   e.g.   from   a   pattern    such    as       first   character,   e.g.   from   a   pattern    such    as
356       (cat|cow|coyote), then it is returned in the integer pointed       (cat|cow|coyote),  it  is returned in the integer pointed to
357       to by where. Otherwise, if either       by where. Otherwise, if either
358    
359       (a) the pattern was compiled with the PCRE_MULTILINE option,       (a) the pattern was compiled with the PCRE_MULTILINE option,
360       and every branch starts with "^", or       and every branch starts with "^", or
# Line 363  INFORMATION ABOUT A PATTERN Line 363  INFORMATION ABOUT A PATTERN
363       PCRE_DOTALL is not set (if it were set, the pattern would be       PCRE_DOTALL is not set (if it were set, the pattern would be
364       anchored),       anchored),
365    
366       then -1 is returned, indicating  that  the  pattern  matches       -1 is returned, indicating that the pattern matches only  at
367       only  at  the  start  of  a subject string or after any "\n"       the  start  of a subject string or after any "\n" within the
368       within the string. Otherwise -2 is  returned.  For  anchored       string. Otherwise -2 is returned.  For anchored patterns, -2
369       patterns, -2 is returned.       is returned.
370    
371         PCRE_INFO_FIRSTTABLE         PCRE_INFO_FIRSTTABLE
372    
# Line 622  EXTRACTING CAPTURED SUBSTRINGS Line 622  EXTRACTING CAPTURED SUBSTRINGS
622       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
623       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
624       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
625       tor, then the value passed as stringcount should be the size       tor,  the  value passed as stringcount should be the size of
626       of the vector divided by three.       the vector divided by three.
627    
628       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
629       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
# Line 739  DIFFERENCES FROM PERL Line 739  DIFFERENCES FROM PERL
739       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value
740       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2
741       unset.    However,    if   the   pattern   is   changed   to       unset.    However,    if   the   pattern   is   changed   to
742       /^(aa(b(b))?)+$/ then $2 (and $3) get set.       /^(aa(b(b))?)+$/ then $2 (and $3) are set.
743    
744       In Perl 5.004 $2 is set in both cases, and that is also true       In Perl 5.004 $2 is set in both cases, and that is also true
745       of PCRE. If in the future Perl changes to a consistent state       of PCRE. If in the future Perl changes to a consistent state
# Line 1056  FULL STOP (PERIOD, DOT) Line 1056  FULL STOP (PERIOD, DOT)
1056       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1057       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1058       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default)  newline.   If  the  PCRE_DOTALL
1059       option  is  set,  then dots match newlines as well. The han-       option  is set, dots match newlines as well. The handling of
1060       dling of dot is entirely independent of the handling of cir-       dot is entirely independent of the  handling  of  circumflex
1061       cumflex  and  dollar,  the only relationship being that they       and  dollar,  the  only  relationship  being  that they both
1062       both involve newline characters.  Dot has no special meaning       involve newline characters. Dot has no special meaning in  a
1063       in a character class.       character class.
1064    
1065    
1066    
# Line 1406  REPETITION Line 1406  REPETITION
1406       fails, because it matches  the  entire  string  due  to  the       fails, because it matches  the  entire  string  due  to  the
1407       greediness of the .*  item.       greediness of the .*  item.
1408    
1409       However, if a quantifier is followed  by  a  question  mark,       However, if a quantifier is followed by a question mark,  it
1410       then it ceases to be greedy, and instead matches the minimum       ceases  to be greedy, and instead matches the minimum number
1411       number of times possible, so the pattern       of times possible, so the pattern
1412    
1413         /\*.*?\*/         /\*.*?\*/
1414    
# Line 1425  REPETITION Line 1425  REPETITION
1425       that is the only way the rest of the pattern matches.       that is the only way the rest of the pattern matches.
1426    
1427       If the PCRE_UNGREEDY option is set (an option which  is  not       If the PCRE_UNGREEDY option is set (an option which  is  not
1428       available  in  Perl)  then the quantifiers are not greedy by       available  in  Perl),  the  quantifiers  are  not  greedy by
1429       default, but individual ones can be made greedy by following       default, but individual ones can be made greedy by following
1430       them  with  a  question mark. In other words, it inverts the       them  with  a  question mark. In other words, it inverts the
1431       default behaviour.       default behaviour.
# Line 1437  REPETITION Line 1437  REPETITION
1437    
1438       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
1439       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
1440       to match newlines, then the pattern is implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
1441       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
1442       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
1443       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
# Line 1490  BACK REFERENCES Line 1490  BACK REFERENCES
1490    
1491       matches "sense and sensibility" and "response and  responsi-       matches "sense and sensibility" and "response and  responsi-
1492       bility",  but  not  "sense  and  responsibility". If caseful       bility",  but  not  "sense  and  responsibility". If caseful
1493       matching is in force at the time of the back reference, then       matching is in force at the time of the back reference,  the
1494       the case of letters is relevant. For example,       case of letters is relevant. For example,
1495    
1496         ((?i)rah)\s+\1         ((?i)rah)\s+\1
1497    
# Line 1501  BACK REFERENCES Line 1501  BACK REFERENCES
1501    
1502       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
1503       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
1504       particular match, then any  back  references  to  it  always       particular match, any back references to it always fail. For
1505       fail. For example, the pattern       example, the pattern
1506    
1507         (a|(bc))\2         (a|(bc))\2
1508    
# Line 1510  BACK REFERENCES Line 1510  BACK REFERENCES
1510       Because  there  may  be up to 99 back references, all digits       Because  there  may  be up to 99 back references, all digits
1511       following the backslash are taken as  part  of  a  potential       following the backslash are taken as  part  of  a  potential
1512       back reference number. If the pattern continues with a digit       back reference number. If the pattern continues with a digit
1513       character, then some delimiter must be used to terminate the       character, some delimiter must be used to terminate the back
1514       back reference. If the PCRE_EXTENDED option is set, this can       reference.   If the PCRE_EXTENDED option is set, this can be
1515       be whitespace.  Otherwise an empty comment can be used.       whitespace. Otherwise an empty comment can be used.
1516    
1517       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
1518       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
# Line 1612  ASSERTIONS Line 1612  ASSERTIONS
1612       matches "foo" preceded by three digits that are  not  "999".       matches "foo" preceded by three digits that are  not  "999".
1613       Notice  that each of the assertions is applied independently       Notice  that each of the assertions is applied independently
1614       at the same point in the subject string. First  there  is  a       at the same point in the subject string. First  there  is  a
1615       check  that  the  previous  three characters are all digits,       check that the previous three characters are all digits, and
1616       then there is a check that the same three characters are not       then there is a check that the same three characters are not
1617       "999".   This  pattern  does not match "foo" preceded by six       "999".   This  pattern  does not match "foo" preceded by six
1618       characters, the first of which are digits and the last three       characters, the first of which are digits and the last three
# Line 1713  ONCE-ONLY SUBPATTERNS Line 1713  ONCE-ONLY SUBPATTERNS
1713    
1714         ^.*abcd$         ^.*abcd$
1715    
1716       then the initial .* matches the entire string at first,  but       the initial .* matches the entire string at first, but  when
1717       when  this  fails  (because  there  is no following "a"), it       this  fails  (because  there  is no following "a"), it back-
1718       backtracks to match all but the last character, then all but       tracks to match all but the last character, then all but the
1719       the  last  two  characters, and so on. Once again the search       last  two  characters,  and so on. Once again the search for
1720       for "a" covers the entire string, from right to left, so  we       "a" covers the entire string, from right to left, so we  are
1721       are no better off. However, if the pattern is written as       no better off. However, if the pattern is written as
1722    
1723         ^(?>.*)(?<=abcd)         ^(?>.*)(?<=abcd)
1724    
1725       then there can be no backtracking for the .*  item;  it  can       there can be no backtracking for the .* item; it  can  match
1726       match  only  the  entire  string.  The subsequent lookbehind       only  the entire string. The subsequent lookbehind assertion
1727       assertion does a single test on the last four characters. If       does a single test on the last four characters. If it fails,
1728       it  fails,  the  match  fails immediately. For long strings,       the match fails immediately. For long strings, this approach
1729       this approach makes a significant difference to the process-       makes a significant difference to the processing time.
      ing time.  
1730    
1731       When a pattern contains an unlimited repeat inside a subpat-       When a pattern contains an unlimited repeat inside a subpat-
1732       tern  that  can  itself  be  repeated an unlimited number of       tern  that  can  itself  be  repeated an unlimited number of
# Line 1777  CONDITIONAL SUBPATTERNS Line 1776  CONDITIONAL SUBPATTERNS
1776       error occurs.       error occurs.
1777    
1778       There are two kinds of condition. If the  text  between  the       There are two kinds of condition. If the  text  between  the
1779       parentheses  consists  of  a  sequence  of  digits, then the       parentheses  consists of a sequence of digits, the condition
1780       condition is satisfied if the capturing subpattern  of  that       is satisfied if the capturing subpattern of that number  has
1781       number  has  previously matched. Consider the following pat-       previously  matched.  Consider  the following pattern, which
1782       tern, which contains non-significant white space to make  it       contains non-significant white space to make it  more  read-
1783       more  readable  (assume  the  PCRE_EXTENDED  option)  and to       able (assume the PCRE_EXTENDED option) and to divide it into
1784       divide it into three parts for ease of discussion:       three parts for ease of discussion:
1785    
1786         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
1787    
# Line 1888  RECURSIVE PATTERNS Line 1887  RECURSIVE PATTERNS
1887    
1888         \( ( ( (?>[^()]+) | (?R) )* ) \)         \( ( ( (?>[^()]+) | (?R) )* ) \)
1889            ^                        ^            ^                        ^
1890            ^                        ^ then the string they capture            ^                        ^ the string they  capture  is
1891       is "ab(cd)ef", the contents of the top level parentheses. If       "ab(cd)ef",  the  contents  of the top level parentheses. If
1892       there are more than 15 capturing parentheses in  a  pattern,       there are more than 15 capturing parentheses in  a  pattern,
1893       PCRE  has  to  obtain  extra  memory  to store data during a       PCRE  has  to  obtain  extra  memory  to store data during a
1894       recursion, which it does by using  pcre_malloc,  freeing  it       recursion, which it does by using  pcre_malloc,  freeing  it

Legend:
Removed from v.43  
changed lines
  Added in v.47

  ViewVC Help
Powered by ViewVC 1.1.5