/[pcre]/code/trunk/doc/html/pcreapi.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcreapi.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 571 by ph10, Sat Nov 6 17:10:00 2010 UTC revision 572 by ph10, Wed Nov 17 17:55:57 2010 UTC
# Line 443  If <i>errptr</i> is NULL, <b>pcre_compil Line 443  If <i>errptr</i> is NULL, <b>pcre_compil
443  Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns  Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns
444  NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual  NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual
445  error message. This is a static string that is part of the library. You must  error message. This is a static string that is part of the library. You must
446  not try to free it. The byte offset from the start of the pattern to the  not try to free it. The offset from the start of the pattern to the byte that
447  character that was being processed when the error was discovered is placed in  was being processed when the error was discovered is placed in the variable
448  the variable pointed to by <i>erroffset</i>, which must not be NULL. If it is,  pointed to by <i>erroffset</i>, which must not be NULL. If it is, an immediate
449  an immediate error is given. Some errors are not detected until checks are  error is given. Some errors are not detected until checks are carried out when
450  carried out when the whole pattern has been scanned; in this case the offset is  the whole pattern has been scanned; in this case the offset is set to the end
451  set to the end of the pattern.  of the pattern.
452    </P>
453    <P>
454    Note that the offset is in bytes, not characters, even in UTF-8 mode. It may
455    point into the middle of a UTF-8 character (for example, when
456    PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
457  </P>  </P>
458  <P>  <P>
459  If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the  If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the
# Line 528  pattern. Line 533  pattern.
533  <pre>  <pre>
534    PCRE_DOTALL    PCRE_DOTALL
535  </pre>  </pre>
536  If this bit is set, a dot metacharater in the pattern matches all characters,  If this bit is set, a dot metacharacter in the pattern matches a character of
537  including those that indicate newline. Without it, a dot does not match when  any value, including one that indicates a newline. However, it only ever
538  the current position is at a newline. This option is equivalent to Perl's /s  matches one character, even if newlines are coded as CRLF. Without this option,
539  option, and it can be changed within a pattern by a (?s) option setting. A  a dot does not match when the current position is at a newline. This option is
540  negative class such as [^a] always matches newline characters, independent of  equivalent to Perl's /s option, and it can be changed within a pattern by a
541  the setting of this option.  (?s) option setting. A negative class such as [^a] always matches newline
542    characters, independent of the setting of this option.
543  <pre>  <pre>
544    PCRE_DUPNAMES    PCRE_DUPNAMES
545  </pre>  </pre>
# Line 554  ignored. This is equivalent to Perl's /x Line 560  ignored. This is equivalent to Perl's /x
560  pattern by a (?x) option setting.  pattern by a (?x) option setting.
561  </P>  </P>
562  <P>  <P>
563    Which characters are interpreted as newlines
564    is controlled by the options passed to <b>pcre_compile()</b> or by a special
565    sequence at the start of the pattern, as described in the section entitled
566    <a href="pcrepattern.html#newlines">"Newline conventions"</a>
567    in the <b>pcrepattern</b> documentation. Note that the end of this type of
568    comment is a literal newline sequence in the pattern; escape sequences that
569    happen to represent a newline do not count.
570    </P>
571    <P>
572  This option makes it possible to include comments inside complicated patterns.  This option makes it possible to include comments inside complicated patterns.
573  Note, however, that this applies only to data characters. Whitespace characters  Note, however, that this applies only to data characters. Whitespace characters
574  may never appear within special character sequences in a pattern, for example  may never appear within special character sequences in a pattern, for example
575  within the sequence (?( which introduces a conditional subpattern.  within the sequence (?( that introduces a conditional subpattern.
576  <pre>  <pre>
577    PCRE_EXTRA    PCRE_EXTRA
578  </pre>  </pre>
# Line 637  PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is Line 652  PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is
652  other combinations may yield unused numbers and cause an error.  other combinations may yield unused numbers and cause an error.
653  </P>  </P>
654  <P>  <P>
655  The only time that a line break is specially recognized when compiling a  The only time that a line break in a pattern is specially recognized when
656  pattern is if PCRE_EXTENDED is set, and an unescaped # outside a character  compiling is when PCRE_EXTENDED is set. CR and LF are whitespace characters,
657  class is encountered. This indicates a comment that lasts until after the next  and so are ignored in this mode. Also, an unescaped # outside a character class
658  line break sequence. In other circumstances, line break sequences are treated  indicates a comment that lasts until after the next line break sequence. In
659  as literal data, except that in PCRE_EXTENDED mode, both CR and LF are treated  other circumstances, line break sequences in patterns are treated as literal
660  as whitespace characters and are therefore ignored.  data.
661  </P>  </P>
662  <P>  <P>
663  The newline option that is set at compile time becomes the default that is used  The newline option that is set at compile time becomes the default that is used
# Line 658  in Perl. Line 673  in Perl.
673  <pre>  <pre>
674    PCRE_UCP    PCRE_UCP
675  </pre>  </pre>
676  This option changes the way PCRE processes \b, \d, \s, \w, and some of the  This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
677  POSIX character classes. By default, only ASCII characters are recognized, but  \w, and some of the POSIX character classes. By default, only ASCII characters
678  if PCRE_UCP is set, Unicode properties are used instead to classify characters.  are recognized, but if PCRE_UCP is set, Unicode properties are used instead to
679  More details are given in the section on  classify characters. More details are given in the section on
680  <a href="pcre.html#genericchartypes">generic character types</a>  <a href="pcre.html#genericchartypes">generic character types</a>
681  in the  in the
682  <a href="pcrepattern.html"><b>pcrepattern</b></a>  <a href="pcrepattern.html"><b>pcrepattern</b></a>
# Line 851  matching. Line 866  matching.
866  The two optimizations just described can be disabled by setting the  The two optimizations just described can be disabled by setting the
867  PCRE_NO_START_OPTIMIZE option when calling <b>pcre_exec()</b> or  PCRE_NO_START_OPTIMIZE option when calling <b>pcre_exec()</b> or
868  <b>pcre_dfa_exec()</b>. You might want to do this if your pattern contains  <b>pcre_dfa_exec()</b>. You might want to do this if your pattern contains
869  callouts, or make use of (*MARK), and you make use of these in cases where  callouts or (*MARK), and you want to make use of these facilities in cases
870  matching fails. See the discussion of PCRE_NO_START_OPTIMIZE  where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
871  <a href="#execoptions">below.</a>  <a href="#execoptions">below.</a>
872  <a name="localesupport"></a></P>  <a name="localesupport"></a></P>
873  <br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br>  <br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br>
# Line 1443  if that fails, by advancing the starting Line 1458  if that fails, by advancing the starting
1458  ordinary match again. There is some code that demonstrates how to do this in  ordinary match again. There is some code that demonstrates how to do this in
1459  the  the
1460  <a href="pcredemo.html"><b>pcredemo</b></a>  <a href="pcredemo.html"><b>pcredemo</b></a>
1461  sample program. In the most general case, you have to check to see if the  sample program. In the most general case, you have to check to see if the
1462  newline convention recognizes CRLF as a newline, and if so, and the current  newline convention recognizes CRLF as a newline, and if so, and the current
1463  character is CR followed by LF, advance the starting offset by two characters  character is CR followed by LF, advance the starting offset by two characters
1464  instead of one.  instead of one.
1465  <pre>  <pre>
# Line 1504  strings in the Line 1519  strings in the
1519  in the main  in the main
1520  <a href="pcre.html"><b>pcre</b></a>  <a href="pcre.html"><b>pcre</b></a>
1521  page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns  page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns
1522  the error PCRE_ERROR_BADUTF8. If <i>startoffset</i> contains a value that does  the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is
1523  not point to the start of a UTF-8 character (or to the end of the subject),  a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If
1524  PCRE_ERROR_BADUTF8_OFFSET is returned.  <i>startoffset</i> contains a value that does not point to the start of a UTF-8
1525    character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
1526    returned.
1527  </P>  </P>
1528  <P>  <P>
1529  If you already know that your subject is valid, and you want to skip these  If you already know that your subject is valid, and you want to skip these
# Line 1536  but only if no complete match can be fou Line 1553  but only if no complete match can be fou
1553  If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a  If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
1554  partial match is found, <b>pcre_exec()</b> immediately returns  partial match is found, <b>pcre_exec()</b> immediately returns
1555  PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,  PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
1556  when PCRE_PARTIAL_HARD is set, a partial match is considered to be more  when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
1557  important that an alternative complete match.  important that an alternative complete match.
1558  </P>  </P>
1559  <P>  <P>
# Line 1552  The string to be matched by <b>pcre_exec Line 1569  The string to be matched by <b>pcre_exec
1569  <P>  <P>
1570  The subject string is passed to <b>pcre_exec()</b> as a pointer in  The subject string is passed to <b>pcre_exec()</b> as a pointer in
1571  <i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset  <i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset
1572  in <i>startoffset</i>. If this is negative or greater than the length of the  in <i>startoffset</i>. If this is negative or greater than the length of the
1573  subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET.  subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting
1574  </P>  offset is zero, the search for a match starts at the beginning of the subject,
1575  <P>  and this is by far the most common case. In UTF-8 mode, the byte offset must
1576  In UTF-8 mode, the byte offset must point to the start of a UTF-8 character (or  point to the start of a UTF-8 character (or the end of the subject). Unlike the
1577  the end of the subject). Unlike the pattern string, the subject may contain  pattern string, the subject may contain binary zero bytes.
 binary zero bytes. When the starting offset is zero, the search for a match  
 starts at the beginning of the subject, and this is by far the most common  
 case.  
1578  </P>  </P>
1579  <P>  <P>
1580  A non-zero starting offset is useful when searching for another match in the  A non-zero starting offset is useful when searching for another match in the
# Line 1589  PCRE_ANCHORED options, and then if that Line 1603  PCRE_ANCHORED options, and then if that
1603  and trying an ordinary match again. There is some code that demonstrates how to  and trying an ordinary match again. There is some code that demonstrates how to
1604  do this in the  do this in the
1605  <a href="pcredemo.html"><b>pcredemo</b></a>  <a href="pcredemo.html"><b>pcredemo</b></a>
1606  sample program. In the most general case, you have to check to see if the  sample program. In the most general case, you have to check to see if the
1607  newline convention recognizes CRLF as a newline, and if so, and the current  newline convention recognizes CRLF as a newline, and if so, and the current
1608  character is CR followed by LF, advance the starting offset by two characters  character is CR followed by LF, advance the starting offset by two characters
1609  instead of one.  instead of one.
1610  </P>  </P>
# Line 1675  Offset values that correspond to unused Line 1689  Offset values that correspond to unused
1689  expression are also set to -1. For example, if the string "abc" is matched  expression are also set to -1. For example, if the string "abc" is matched
1690  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The
1691  return from the function is 2, because the highest used capturing subpattern  return from the function is 2, because the highest used capturing subpattern
1692  number is 1. However, you can refer to the offsets for the second and third  number is 1, and the offsets for for the second and third capturing subpatterns
1693  capturing subpatterns if you wish (assuming the vector is large enough, of  (assuming the vector is large enough, of course) are set to -1.
1694  course).  </P>
1695    <P>
1696    <b>Note</b>: Elements of <i>ovector</i> that do not correspond to capturing
1697    parentheses in the pattern are never changed. That is, if a pattern contains
1698    <i>n</i> capturing parentheses, no more than <i>ovector[0]</i> to
1699    <i>ovector[2n+1]</i> are set by <b>pcre_exec()</b>. The other elements retain
1700    whatever values they previously had.
1701  </P>  </P>
1702  <P>  <P>
1703  Some convenience functions are provided for extracting the captured substrings  Some convenience functions are provided for extracting the captured substrings
# Line 1752  documentation for details. Line 1772  documentation for details.
1772    PCRE_ERROR_BADUTF8        (-10)    PCRE_ERROR_BADUTF8        (-10)
1773  </pre>  </pre>
1774  A string that contains an invalid UTF-8 byte sequence was passed as a subject.  A string that contains an invalid UTF-8 byte sequence was passed as a subject.
1775    However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8
1776    character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead.
1777  <pre>  <pre>
1778    PCRE_ERROR_BADUTF8_OFFSET (-11)    PCRE_ERROR_BADUTF8_OFFSET (-11)
1779  </pre>  </pre>
1780  The UTF-8 byte sequence that was passed as a subject was valid, but the value  The UTF-8 byte sequence that was passed as a subject was valid, but the value
1781  of <i>startoffset</i> did not point to the beginning of a UTF-8 character.  of <i>startoffset</i> did not point to the beginning of a UTF-8 character or the
1782    end of the subject.
1783  <pre>  <pre>
1784    PCRE_ERROR_PARTIAL        (-12)    PCRE_ERROR_PARTIAL        (-12)
1785  </pre>  </pre>
# Line 1792  An invalid combination of PCRE_NEWLINE_< Line 1815  An invalid combination of PCRE_NEWLINE_<
1815  <pre>  <pre>
1816    PCRE_ERROR_BADOFFSET      (-24)    PCRE_ERROR_BADOFFSET      (-24)
1817  </pre>  </pre>
1818  The value of <i>startoffset</i> was negative or greater than the length of the  The value of <i>startoffset</i> was negative or greater than the length of the
1819  subject, that is, the value in <i>length</i>.  subject, that is, the value in <i>length</i>.
1820    <pre>
1821      PCRE_ERROR_SHORTUTF8      (-25)
1822    </pre>
1823    The subject string ended with an incomplete (truncated) UTF-8 character, and
1824    the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8
1825    is returned in this situation.
1826  </P>  </P>
1827  <P>  <P>
1828  Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>.  Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>.
# Line 2203  Cambridge CB2 3QH, England. Line 2232  Cambridge CB2 3QH, England.
2232  </P>  </P>
2233  <br><a name="SEC22" href="#TOC1">REVISION</a><br>  <br><a name="SEC22" href="#TOC1">REVISION</a><br>
2234  <P>  <P>
2235  Last updated: 06 November 2010  Last updated: 13 November 2010
2236  <br>  <br>
2237  Copyright &copy; 1997-2010 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
2238  <br>  <br>

Legend:
Removed from v.571  
changed lines
  Added in v.572

  ViewVC Help
Powered by ViewVC 1.1.5