/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 63 by nigel, Sat Feb 24 21:40:03 2007 UTC revision 71 by nigel, Sat Feb 24 21:40:24 2007 UTC
# Line 118  UTF-8 SUPPORT Line 118  UTF-8 SUPPORT
118       The following comments apply when PCRE is running  in  UTF-8       The following comments apply when PCRE is running  in  UTF-8
119       mode:       mode:
120    
121       1. PCRE assumes that the strings it is given  contain  valid       1. When you set the PCRE_UTF8 flag, the  strings  passed  as
122       UTF-8  codes. It does not diagnose invalid UTF-8 strings. If       patterns  and  subjects are checked for validity on entry to
123       you pass invalid UTF-8 strings  to  PCRE,  the  results  are       the relevant  functions.  If  an  invalid  UTF-8  string  is
124       undefined.       passed,  an  error  return is given. In some situations, you
125         may already know that your strings are valid, and  therefore
126         want  to  skip these checks in order to improve performance.
127         If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
128         run  time,  PCRE  assumes  that the pattern or subject it is
129         given (respectively) contains only  valid  UTF-8  codes.  In
130         this  case, it does not diagnose an invalid UTF-8 string. If
131         you  pass   an   invalid   UTF-8   string   to   PCRE   when
132         PCRE_NO_UTF8_CHECK  is  set, the results are undefined. Your
133         program may crash.
134    
135       2. In a pattern, the escape sequence \x{...}, where the con-       2. In a pattern, the escape sequence \x{...}, where the con-
136       tents  of  the  braces is a string of hexadecimal digits, is       tents  of  the  braces is a string of hexadecimal digits, is
# Line 164  AUTHOR Line 173  AUTHOR
173       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
174       Phone: +44 1223 334714       Phone: +44 1223 334714
175    
176  Last updated: 04 February 2003  Last updated: 20 August 2003
177  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
178  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
179    
# Line 654  COMPILING A PATTERN Line 663  COMPILING A PATTERN
663       option  changes  the behaviour of PCRE are given in the sec-       option  changes  the behaviour of PCRE are given in the sec-
664       tion on UTF-8 support in the main pcre page.       tion on UTF-8 support in the main pcre page.
665    
666           PCRE_NO_UTF8_CHECK
667    
668         When PCRE_UTF8 is set, the validity  of  the  pattern  as  a
669         UTF-8  string  is automatically checked. If an invalid UTF-8
670         sequence of bytes is found, pcre_compile() returns an error.
671         If you already know that your pattern is valid, and you want
672         to skip this check for performance reasons, you can set  the
673         PCRE_NO_UTF8_CHECK  option.  When  it  is set, the effect of
674         passing an invalid UTF-8 string as a pattern  is  undefined.
675         It  may  cause  your program to crash.  Note that there is a
676         similar option  for  suppressing  the  checking  of  subject
677         strings passed to pcre_exec().
678    
679    
680    
681  STUDYING A PATTERN  STUDYING A PATTERN
682    
# Line 747  INFORMATION ABOUT A PATTERN Line 770  INFORMATION ABOUT A PATTERN
770       compiled pattern. It replaces the obsolete pcre_info() func-       compiled pattern. It replaces the obsolete pcre_info() func-
771       tion, which is nevertheless retained for backwards compabil-       tion, which is nevertheless retained for backwards compabil-
772       ity (and is documented below).       ity (and is documented below).
   
773       The first argument for pcre_fullinfo() is a pointer  to  the       The first argument for pcre_fullinfo() is a pointer  to  the
774       compiled  pattern.  The  second  argument  is  the result of       compiled  pattern.  The  second  argument  is  the result of
775       pcre_study(), or NULL if the pattern was  not  studied.  The       pcre_study(), or NULL if the pattern was  not  studied.  The
# Line 819  INFORMATION ABOUT A PATTERN Line 841  INFORMATION ABOUT A PATTERN
841    
842         PCRE_INFO_LASTLITERAL         PCRE_INFO_LASTLITERAL
843    
844       For a non-anchored pattern, return the value of  the  right-       Return the value of the rightmost  literal  byte  that  must
845       most  literal  byte  which must exist in any matched string,       exist  in  any  matched  string, other than at its start, if
846       other than at its start. The fourth argument should point to       such a byte has been recorded. The  fourth  argument  should
847       an int variable. If there is no such byte, or if the pattern       point  to  an  int variable. If there is no such byte, -1 is
848       is anchored, -1 is returned. For example,  for  the  pattern       returned. For anchored patterns,  a  last  literal  byte  is
849       /a\d+z\d+/ the returned value is 'z'.       recorded  only  if  it follows something of variable length.
850         For example, for the pattern /^a\d+z\d+/ the returned  value
851         is "z", but for /^a\dz\d/ the returned value is -1.
852    
853         PCRE_INFO_NAMECOUNT         PCRE_INFO_NAMECOUNT
854         PCRE_INFO_NAMEENTRYSIZE         PCRE_INFO_NAMEENTRYSIZE
# Line 1012  MATCHING A PATTERN Line 1036  MATCHING A PATTERN
1036       turned out to be anchored by virtue of its contents, it can-       turned out to be anchored by virtue of its contents, it can-
1037       not be made unachored at matching time.       not be made unachored at matching time.
1038    
1039         When PCRE_UTF8 was set at compile time, the validity of  the
1040         subject  as  a  UTF-8 string is automatically checked. If an
1041         invalid  UTF-8  sequence  of  bytes  is  found,  pcre_exec()
1042         returns  the  error  PCRE_ERROR_BADUTF8. If you already know
1043         that your subject is valid, and you want to skip this  check
1044         for  performance reasons, you can set the PCRE_NO_UTF8_CHECK
1045         option when calling pcre_exec(). When this  option  is  set,
1046         the  effect  of passing an invalid UTF-8 string as a subject
1047         is undefined. It may cause your program to crash.
1048    
1049       There are also three further options that can be set only at       There are also three further options that can be set only at
1050       matching time:       matching time:
1051    
# Line 1101  MATCHING A PATTERN Line 1135  MATCHING A PATTERN
1135       used for a fragment of a pattern that picks out a substring.       used for a fragment of a pattern that picks out a substring.
1136       PCRE supports several other kinds of  parenthesized  subpat-       PCRE supports several other kinds of  parenthesized  subpat-
1137       tern that do not cause substrings to be captured.       tern that do not cause substrings to be captured.
   
1138       Captured substrings are returned to the caller via a  vector       Captured substrings are returned to the caller via a  vector
1139       of  integer  offsets whose address is passed in ovector. The       of  integer  offsets whose address is passed in ovector. The
1140       number of elements in the vector is passed in ovecsize.  The       number of elements in the vector is passed in ovecsize.  The
# Line 1127  MATCHING A PATTERN Line 1160  MATCHING A PATTERN
1160       there  are no capturing subpatterns, the return value from a       there  are no capturing subpatterns, the return value from a
1161       successful match is 1, indicating that just the  first  pair       successful match is 1, indicating that just the  first  pair
1162       of offsets has been set.       of offsets has been set.
1163    
1164       Some convenience functions are provided for  extracting  the       Some convenience functions are provided for  extracting  the
1165       captured substrings as separate strings. These are described       captured substrings as separate strings. These are described
1166       in the following section.       in the following section.
# Line 1216  MATCHING A PATTERN Line 1250  MATCHING A PATTERN
1250       distinctive error code. See  the  pcrecallout  documentation       distinctive error code. See  the  pcrecallout  documentation
1251       for details.       for details.
1252    
1253           PCRE_ERROR_BADUTF8       (-10)
1254    
1255         A string that contains an invalid UTF-8  byte  sequence  was
1256         passed as a subject.
1257    
1258    
1259  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1260    
# Line 1230  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1269  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1269       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
1270            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
1271    
   
1272       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
1273       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
1274       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
# Line 1253  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1291  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1291       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
1292       tor,  the  value passed as stringcount should be the size of       tor,  the  value passed as stringcount should be the size of
1293       the vector divided by three.       the vector divided by three.
   
1294       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
1295       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
1296       number. A value of zero extracts the substring that  matched       number. A value of zero extracts the substring that  matched
# Line 1350  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1387  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1387       succeeds,    they   then   call   pcre_copy_substring()   or       succeeds,    they   then   call   pcre_copy_substring()   or
1388       pcre_get_substring(), as appropriate.       pcre_get_substring(), as appropriate.
1389    
1390  Last updated: 03 February 2003  Last updated: 20 August 2003
1391  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
1392  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1393    
# Line 1418  PCRE CALLOUTS Line 1455  PCRE CALLOUTS
1455       The current_position field contains the  offset  within  the       The current_position field contains the  offset  within  the
1456       subject of the current match pointer.       subject of the current match pointer.
1457    
1458       The capture_top field contains the  number  of  the  highest       The capture_top field contains one more than the  number  of
1459       captured substring so far.       the  highest  numbered captured substring so far. If no sub-
1460         strings have been captured, the value of capture_top is one.
1461    
1462       The capture_last field  contains  the  number  of  the  most       The capture_last field  contains  the  number  of  the  most
1463       recently captured substring.       recently captured substring.
# Line 3088  DESCRIPTION Line 3126  DESCRIPTION
3126       that is POSIX-like in style. The syntax and semantics of the       that is POSIX-like in style. The syntax and semantics of the
3127       regular expressions themselves are still those of Perl, sub-       regular expressions themselves are still those of Perl, sub-
3128       ject  to  the  setting of various PCRE options, as described       ject  to  the  setting of various PCRE options, as described
3129       below.       below. "POSIX-like in style" means that the API approximates
3130         to  the  POSIX definition; it is not fully POSIX-compatible,
3131         and in multi-byte encoding domains it is probably even  less
3132         compatible.
3133    
3134       The header for these functions is supplied as pcreposix.h to       The header for these functions is supplied as pcreposix.h to
3135       avoid  any  potential  clash  with other POSIX libraries. It       avoid  any  potential  clash  with other POSIX libraries. It

Legend:
Removed from v.63  
changed lines
  Added in v.71

  ViewVC Help
Powered by ViewVC 1.1.5