/[pcre]/code/trunk/pcre.3
ViewVC logotype

Diff of /code/trunk/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 23 by nigel, Sat Feb 24 21:38:41 2007 UTC revision 29 by nigel, Sat Feb 24 21:38:53 2007 UTC
# Line 8  pcre - Perl-compatible regular expressio Line 8  pcre - Perl-compatible regular expressio
8  .br  .br
9  .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR,  .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR,
10  .ti +5n  .ti +5n
11  .B const char **\fIerrptr\fR, int *\fIerroffset\fR);  .B const char **\fIerrptr\fR, int *\fIerroffset\fR,
12    .ti +5n
13    .B const unsigned char *\fItableptr\fR);
14  .PP  .PP
15  .br  .br
16  .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR,  .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR,
# Line 23  pcre - Perl-compatible regular expressio Line 25  pcre - Perl-compatible regular expressio
25  .B int *\fIovector\fR, int \fIovecsize\fR);  .B int *\fIovector\fR, int \fIovecsize\fR);
26  .PP  .PP
27  .br  .br
28  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int  .B int pcre_copy_substring(const char *\fIsubject\fR, int *\fIovector\fR,
29  .B *\fIfirstcharptr\fR);  .ti +5n
30    .B int \fIstringcount\fR, int \fIstringnumber\fR, char *\fIbuffer\fR,
31    .ti +5n
32    .B int \fIbuffersize\fR);
33  .PP  .PP
34  .br  .br
35  .B char *pcre_version(void);  .B int pcre_get_substring(const char *\fIsubject\fR, int *\fIovector\fR,
36    .ti +5n
37    .B int \fIstringcount\fR, int \fIstringnumber\fR,
38    .ti +5n
39    .B const char **\fIstringptr\fR);
40  .PP  .PP
41  .br  .br
42  .B void *(*pcre_malloc)(size_t);  .B int pcre_get_substring_list(const char *\fIsubject\fR,
43    .ti +5n
44    .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"
45  .PP  .PP
46  .br  .br
47  .B void (*pcre_free)(void *);  .B const unsigned char *pcre_maketables(void);
48  .PP  .PP
49  .br  .br
50  .B unsigned char *pcre_cbits[128];  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int
51    .B *\fIfirstcharptr\fR);
52  .PP  .PP
53  .br  .br
54  .B unsigned char *pcre_ctypes[256];  .B char *pcre_version(void);
55  .PP  .PP
56  .br  .br
57  .B unsigned char *pcre_fcc[256];  .B void *(*pcre_malloc)(size_t);
58  .PP  .PP
59  .br  .br
60  .B unsigned char *pcre_lcc[256];  .B void (*pcre_free)(void *);
61    
62    
63    
# Line 58  PCRE has its own native API, which is de Line 70  PCRE has its own native API, which is de
70  a set of wrapper functions that correspond to the POSIX API. See  a set of wrapper functions that correspond to the POSIX API. See
71  \fBpcreposix (3)\fR.  \fBpcreposix (3)\fR.
72    
73  The three functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
74  \fBpcre_exec()\fR are used for compiling and matching regular expressions. The  are used for compiling and matching regular expressions, while
75  function \fBpcre_info()\fR is used to find out information about a compiled  \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
76    \fBpcre_get_substring_list()\fR are convenience functions for extracting
77    captured substrings from a matched subject string. The function
78    \fBpcre_maketables()\fR is used (optionally) to build a set of character tables
79    in the current locale for passing to \fBpcre_compile()\fR.
80    
81    The function \fBpcre_info()\fR is used to find out information about a compiled
82  pattern, while the function \fBpcre_version()\fR returns a pointer to a string  pattern, while the function \fBpcre_version()\fR returns a pointer to a string
83  containing the version of PCRE and its date of release.  containing the version of PCRE and its date of release.
84    
# Line 70  respectively. PCRE calls the memory mana Line 88  respectively. PCRE calls the memory mana
88  so a calling program can replace them if it wishes to intercept the calls. This  so a calling program can replace them if it wishes to intercept the calls. This
89  should be done before calling any PCRE functions.  should be done before calling any PCRE functions.
90    
 The other global variables are character tables. They are initialized when PCRE  
 is compiled, from source that is generated by reference to the C character type  
 functions, but which a user of PCRE is free to modify. In principle the tables  
 could also be modified at run time. See PCRE's README file for more details.  
   
91    
92  .SH MULTI-THREADING  .SH MULTI-THREADING
93  The PCRE functions can be used in multi-threading applications, with the  The PCRE functions can be used in multi-threading applications, with the
94  proviso that the character tables and the memory management functions pointed  proviso that the memory management functions pointed to by \fBpcre_malloc\fR
95  to by \fBpcre_malloc\fR and \fBpcre_free\fR are shared by all threads.  and \fBpcre_free\fR are shared by all threads.
96    
97  The compiled form of a regular expression is not altered during matching, so  The compiled form of a regular expression is not altered during matching, so
98  the same compiled pattern can safely be used by several threads at once.  the same compiled pattern can safely be used by several threads at once.
# Line 88  the same compiled pattern can safely be Line 101  the same compiled pattern can safely be
101  .SH COMPILING A PATTERN  .SH COMPILING A PATTERN
102  The function \fBpcre_compile()\fR is called to compile a pattern into an  The function \fBpcre_compile()\fR is called to compile a pattern into an
103  internal form. The pattern is a C string terminated by a binary zero, and  internal form. The pattern is a C string terminated by a binary zero, and
104  is passed in the argument \fIpattern\fR. A pointer to the compiled code block  is passed in the argument \fIpattern\fR. A pointer to a single block of memory
105  is returned. The \fBpcre\fR type is defined for this for convenience, but in  that is obtained via \fBpcre_malloc\fR is returned. This contains the
106  fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the contents of the  compiled code and related data. The \fBpcre\fR type is defined for this for
107  block are not defined.  convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the
108    contents of the block are not externally defined. It is up to the caller to
109    free the memory when it is no longer required.
110  .PP  .PP
111  The size of a compiled pattern is roughly proportional to the length of the  The size of a compiled pattern is roughly proportional to the length of the
112  pattern string, except that each character class (other than those containing  pattern string, except that each character class (other than those containing
# Line 111  time. Line 126  time.
126  If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.  If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.
127  Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns  Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns
128  NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual  NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual
129  error message.  error message. The offset from the start of the pattern to the character where
130    the error was discovered is placed in the variable pointed to by
131  The offset from the start of the pattern to the character where the error was  \fIerroffset\fR, which must not be NULL. If it is, an immediate error is given.
132  discovered is placed in the variable pointed to by \fIerroffset\fR, which must  .PP
133  not be NULL. If it is, an immediate error is given.  If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of
134    character tables which are built when it is compiled, using the default C
135    locale. Otherwise, \fItableptr\fR must be the result of a call to
136    \fBpcre_maketables()\fR. See the section on locale support below.
137  .PP  .PP
138  The following option bits are defined in the header file:  The following option bits are defined in the header file:
139    
# Line 210  not have a single fixed starting charact Line 228  not have a single fixed starting charact
228  characters is created.  characters is created.
229    
230    
231    .SH LOCALE SUPPORT
232    PCRE handles caseless matching, and determines whether characters are letters,
233    digits, or whatever, by reference to a set of tables. The library contains a
234    default set of tables which is created in the default C locale when PCRE is
235    compiled. This is used when the final argument of \fBpcre_compile()\fR is NULL,
236    and is sufficient for many applications.
237    
238    An alternative set of tables can, however, be supplied. Such tables are built
239    by calling the \fBpcre_maketables()\fR function, which has no arguments, in the
240    relevant locale. The result can then be passed to \fBpcre_compile()\ as often
241    as necessary. For example, to build and use tables that are appropriate for the
242    French locale (where accented characters with codes greater than 128 are
243    treated as letters), the following code could be used:
244    
245      setlocale(LC_CTYPE, "fr");
246      tables = pcre_maketables();
247      re = pcre_compile(..., tables);
248    
249    The tables are built in memory that is obtained via \fBpcre_malloc\fR. The
250    pointer that is passed to \fBpcre_compile\fR is saved with the compiled
251    pattern, and the same tables are used via this pointer by \fBpcre_study()\fR
252    and \fBpcre_match()\fR. Thus for any single pattern, compilation, studying and
253    matching all happen in the same locale, but different patterns can be compiled
254    in different locales. It is the caller's responsibility to ensure that the
255    memory containing the tables remains available for as long as it is needed.
256    
257    
258    .SH INFORMATION ABOUT A PATTERN
259    The \fBpcre_info()\fR function returns information about a compiled pattern.
260    Its yield is the number of capturing subpatterns, or one of the following
261    negative numbers:
262    
263      PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
264      PCRE_ERROR_BADMAGIC   the "magic number" was not found
265    
266    If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
267    pattern was compiled is placed in the integer it points to.
268    
269    If the \fIfirstcharptr\fR argument is not NULL, is is used to pass back
270    information about the first character of any matched string. If there is a
271    fixed first character, e.g. from a pattern such as (cat|cow|coyote), then it is
272    returned in the integer pointed to by \fIfirstcharptr\fR. Otherwise, if the
273    pattern was compiled with the PCRE_MULTILINE option, and every branch started
274    with "^", then -1 is returned, indicating that the pattern will match at the
275    start of a subject string or after any "\\n" within the string. Otherwise -2 is
276    returned.
277    
278    
279  .SH MATCHING A PATTERN  .SH MATCHING A PATTERN
280  The function \fBpcre_exec()\fR is called to match a subject string against a  The function \fBpcre_exec()\fR is called to match a subject string against a
281  pre-compiled pattern, which is passed in the \fIcode\fR argument. If the  pre-compiled pattern, which is passed in the \fIcode\fR argument. If the
# Line 267  is the number of pairs that have been se Line 333  is the number of pairs that have been se
333  subpatterns, the return value from a successful match is 1, indicating that  subpatterns, the return value from a successful match is 1, indicating that
334  just the first pair of offsets has been set.  just the first pair of offsets has been set.
335    
336    Some convenience functions are provided for extracting the captured substrings
337    as separate strings. These are described in the following section.
338    
339  It is possible for an capturing subpattern number \fIn+1\fR to match some  It is possible for an capturing subpattern number \fIn+1\fR to match some
340  part of the subject when subpattern \fIn\fR has not been used at all. For  part of the subject when subpattern \fIn\fR has not been used at all. For
341  example, if the string "abc" is matched against the pattern (a|(z))(bc)  example, if the string "abc" is matched against the pattern (a|(z))(bc)
# Line 327  call via \fBpcre_malloc()\fR fails, this Line 396  call via \fBpcre_malloc()\fR fails, this
396  the end of matching.  the end of matching.
397    
398    
399  .SH INFORMATION ABOUT A PATTERN  .SH EXTRACTING CAPTURED SUBSTRINGS
400  The \fBpcre_info()\fR function returns information about a compiled pattern.  Captured substrings can be accessed directly by using the offsets returned by
401  Its yield is the number of capturing subpatterns, or one of the following  \fBpcre_exec()\fR in \fIovector\fR. For convenience, the functions
402  negative numbers:  \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
403    \fBpcre_get_substring_list()\fR are provided for extracting captured substrings
404    as new, separate, zero-terminated strings. A substring that contains a binary
405    zero is correctly extracted and has a further zero added on the end, but the
406    result does not, of course, function as a C string.
407    
408    The first three arguments are the same for all three functions: \fIsubject\fR
409    is the subject string which has just been successfully matched, \fIovector\fR
410    is a pointer to the vector of integer offsets that was passed to
411    \fBpcre_exec()\fR, and \fIstringcount\fR is the number of substrings that
412    were captured by the match, including the substring that matched the entire
413    regular expression. This is the value returned by \fBpcre_exec\fR if it
414    is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it
415    ran out of space in \fIovector\fR, then the value passed as
416    \fIstringcount\fR should be the size of the vector divided by three.
417    
418    The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR
419    extract a single substring, whose number is given as \fIstringnumber\fR. A
420    value of zero extracts the substring that matched the entire pattern, while
421    higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
422    the string is placed in \fIbuffer\fR, whose length is given by
423    \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is
424    obtained via \fBpcre_malloc\fR, and its address is returned via
425    \fIstringptr\fR. The yield of the function is the length of the string, not
426    including the terminating zero, or one of
427    
428    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL    PCRE_ERROR_NOMEMORY       (-6)
   PCRE_ERROR_BADMAGIC   the "magic number" was not found  
429    
430  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the  The buffer was too small for \fBpcre_copy_substring()\fR, or the attempt to get
431  pattern was compiled is placed in the integer it points to.  memory failed for \fBpcre_get_substring()\fR.
432    
433      PCRE_ERROR_NOSUBSTRING    (-7)
434    
435    There is no substring whose number is \fIstringnumber\fR.
436    
437    The \fBpcre_get_substring_list()\fR function extracts all available substrings
438    and builds a list of pointers to them. All this is done in a single block of
439    memory which is obtained via \fBpcre_malloc\fR. The address of the memory block
440    is returned via \fIlistptr\fR, which is also the start of the list of string
441    pointers. The end of the list is marked by a NULL pointer. The yield of the
442    function is zero if all went well, or
443    
444      PCRE_ERROR_NOMEMORY       (-6)
445    
446    if the attempt to get the memory block failed.
447    
448    When any of these functions encounter a substring that is unset, which can
449    happen when capturing subpattern number \fIn+1\fR matches some part of the
450    subject, but subpattern \fIn\fR has not been used at all, they return an empty
451    string. This can be distinguished from a genuine zero-length substring by
452    inspecting the appropriate offset in \fIovector\fR, which is negative for unset
453    substrings.
454    
 If the \fIfirstcharptr\fR argument is not NULL, is is used to pass back  
 information about the first character of any matched string. If there is a  
 fixed first character, e.g. from a pattern such as (cat|cow|coyote), then it is  
 returned in the integer pointed to by \fIfirstcharptr\fR. Otherwise, if the  
 pattern was compiled with the PCRE_MULTILINE option, and every branch started  
 with "^", then -1 is returned, indicating that the pattern will match at the  
 start of a subject string or after any "\\n" within the string. Otherwise -2 is  
 returned.  
455    
456    
457  .SH LIMITATIONS  .SH LIMITATIONS
# Line 579  Each pair of escape sequences partitions Line 685  Each pair of escape sequences partitions
685  two disjoint sets. Any given character matches one, and only one, of each pair.  two disjoint sets. Any given character matches one, and only one, of each pair.
686    
687  A "word" character is any letter or digit or the underscore character, that is,  A "word" character is any letter or digit or the underscore character, that is,
688  any character which can be part of a Perl "word". These character type  any character which can be part of a Perl "word". The definition of letters and
689  sequences can appear both inside and outside character classes. They each match  digits is controlled by PCRE's character tables, and may vary if locale-
690  one character of the appropriate type. If the current matching point is at the  specific matching is taking place (see "Locale support" above). For example, in
691  end of the subject string, all of them fail, since there is no character to  the "fr" (French) locale, some character codes greater than 128 are used for
692  match.  accented letters, and these are matched by \\w.
693    
694    These character type sequences can appear both inside and outside character
695    classes. They each match one character of the appropriate type. If the current
696    matching point is at the end of the subject string, all of them fail, since
697    there is no character to match.
698    
699  The fourth use of backslash is for certain simple assertions. An assertion  The fourth use of backslash is for certain simple assertions. An assertion
700  specifies a condition that has to be met at a particular point in a match,  specifies a condition that has to be met at a particular point in a match,
# Line 682  are in the class by enumerating those th Line 793  are in the class by enumerating those th
793  still consumes a character from the subject string, and fails if the current  still consumes a character from the subject string, and fails if the current
794  pointer is at the end of the string.  pointer is at the end of the string.
795    
796  When PCRE_CASELESS is set, any letters in a class represent both their upper  When caseless matching is set, any letters in a class represent both their
797  case and lower case versions, so for example, a caseless [aeiou] matches "A" as  upper case and lower case versions, so for example, a caseless [aeiou] matches
798  well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful  "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
799  version would.  caseful version would.
800    
801  The newline character is never treated in any special way in character classes,  The newline character is never treated in any special way in character classes,
802  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class
# Line 695  The minus (hyphen) character can be used Line 806  The minus (hyphen) character can be used
806  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
807  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
808  a backslash or appear in a position where it cannot be interpreted as  a backslash or appear in a position where it cannot be interpreted as
809  indicating a range, typically as the first or last character in the class. It  indicating a range, typically as the first or last character in the class.
810  is not possible to have the character "]" as the end character of a range,  
811  since a sequence such as [w-] is interpreted as a class of two characters. The  It is not possible to have the literal character "]" as the end character of a
812  octal or hexadecimal representation of "]" can, however, be used to end a  range. A pattern such as [W-]46] is interpreted as a class of two characters
813  range.  ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
814    "-46]". However, if the "]" is escaped with a backslash it is interpreted as
815    the end of range, so [W-\\]46] is interpreted as a single class containing a
816    range followed by two separate characters. The octal or hexadecimal
817    representation of "]" can also be used to end a range.
818    
819  Ranges operate in ASCII collating sequence. They can also be used for  Ranges operate in ASCII collating sequence. They can also be used for
820  characters specified numerically, for example [\\000-\\037]. If a range such as  characters specified numerically, for example [\\000-\\037]. If a range that
821  [W-c] is used when PCRE_CASELESS is set, it matches the letters involved in  includes letters is used when caseless matching is set, it matches the letters
822  either case, so is equivalent to [][\\^_`wxyzabc], matched caselessly.  in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched
823    caselessly, and if character tables for the "fr" locale are in use,
824    [\\xc8-\\xcb] matches accented E characters in both cases.
825    
826  The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a  The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a
827  character class, and add the characters that they match to the class. For  character class, and add the characters that they match to the class. For
# Line 1060  same length of string. An assertion such Line 1177  same length of string. An assertion such
1177    
1178    (?<=ab(c|de))    (?<=ab(c|de))
1179    
1180  is not permitted, because its single branch can match two different lengths,  is not permitted, because its single top-level branch can match two different
1181  but it is acceptable if rewritten to use two branches:  lengths, but it is acceptable if rewritten to use two top-level branches:
1182    
1183    (?<=abc|abde)    (?<=abc|abde)
1184    
1185  The implementation of lookbehind assertions is, for each alternative, to  The implementation of lookbehind assertions is, for each alternative, to
1186  temporarily move the current position back by the fixed width and then try to  temporarily move the current position back by the fixed width and then try to
1187  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
1188  match is deemed to fail.  match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns
1189    can be particularly useful for matching at the ends of strings; an example is
1190    given at the end of the section on once-only subpatterns.
1191    
1192    Several assertions (of any sort) may occur in succession. For example,
1193    
1194  Assertions can be nested in any combination. For example,    (?<=\\d{3})(?<!999)foo
1195    
1196    matches "foo" preceded by three digits that are not "999". Furthermore,
1197    assertions can be nested in any combination. For example,
1198    
1199    (?<=(?<!foo)bar)baz    (?<=(?<!foo)bar)baz
1200    
# Line 1119  of characters that an identical standalo Line 1243  of characters that an identical standalo
1243  the current point in the subject string.  the current point in the subject string.
1244    
1245  Once-only subpatterns are not capturing subpatterns. Simple cases such as the  Once-only subpatterns are not capturing subpatterns. Simple cases such as the
1246  above example can be though of as a maximizing repeat that must swallow  above example can be thought of as a maximizing repeat that must swallow
1247  everything it can. So, while both \\d+ and \\d+? are prepared to adjust the  everything it can. So, while both \\d+ and \\d+? are prepared to adjust the
1248  number of digits they match in order to make the rest of the pattern match,  number of digits they match in order to make the rest of the pattern match,
1249  (?>\\d+) can only match an entire sequence of digits.  (?>\\d+) can only match an entire sequence of digits.
# Line 1127  number of digits they match in order to Line 1251  number of digits they match in order to
1251  This construction can of course contain arbitrarily complicated subpatterns,  This construction can of course contain arbitrarily complicated subpatterns,
1252  and it can be nested.  and it can be nested.
1253    
1254    Once-only subpatterns can be used in conjunction with lookbehind assertions to
1255    specify efficient matching at the end of the subject string. Consider a simple
1256    pattern such as
1257    
1258      abcd$
1259    
1260    when applied to a long string which does not match it. Because matching
1261    proceeds from left to right, PCRE will look for each "a" in the subject and
1262    then see if what follows matches the rest of the pattern. If the pattern is
1263    specified as
1264    
1265      .*abcd$
1266    
1267    then the initial .* matches the entire string at first, but when this fails, it
1268    backtracks to match all but the last character, then all but the last two
1269    characters, and so on. Once again the search for "a" covers the entire string,
1270    from right to left, so we are no better off. However, if the pattern is written
1271    as
1272    
1273      (?>.*)(?<=abcd)
1274    
1275    then there can be no backtracking for the .* item; it can match only the entire
1276    string. The subsequent lookbehind assertion does a single test on the last four
1277    characters. If it fails, the match fails immediately. For long strings, this
1278    approach makes a significant difference to the processing time.
1279    
1280    
1281  .SH CONDITIONAL SUBPATTERNS  .SH CONDITIONAL SUBPATTERNS
1282  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
# Line 1206  Cambridge CB2 3QG, England. Line 1356  Cambridge CB2 3QG, England.
1356  .br  .br
1357  Phone: +44 1223 334714  Phone: +44 1223 334714
1358    
1359  Copyright (c) 1998 University of Cambridge.  Copyright (c) 1997-1999 University of Cambridge.

Legend:
Removed from v.23  
changed lines
  Added in v.29

  ViewVC Help
Powered by ViewVC 1.1.5