--- code/trunk/pcre.3 2007/02/24 21:38:43 24 +++ code/trunk/pcre.3 2007/02/24 21:38:45 25 @@ -8,7 +8,12 @@ .br .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR, .ti +5n -.B const char **\fIerrptr\fR, int *\fIerroffset\fR); +.B const char **\fIerrptr\fR, int *\fIerroffset\fR, +.ti +5n +.B const unsigned char *\fItableptr\fR); +.PP +.br +.B const unsigned char *pcre_maketables(void); .PP .br .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR, @@ -34,18 +39,6 @@ .PP .br .B void (*pcre_free)(void *); -.PP -.br -.B unsigned char *pcre_cbits[128]; -.PP -.br -.B unsigned char *pcre_ctypes[256]; -.PP -.br -.B unsigned char *pcre_fcc[256]; -.PP -.br -.B unsigned char *pcre_lcc[256]; @@ -60,7 +53,10 @@ The three functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR are used for compiling and matching regular expressions. The -function \fBpcre_info()\fR is used to find out information about a compiled +function \fBpcre_maketables()\fR is used (optionally) to build a set of +character tables in the current locale for passing to \fBpcre_compile()\fR. + +The function \fBpcre_info()\fR is used to find out information about a compiled pattern, while the function \fBpcre_version()\fR returns a pointer to a string containing the version of PCRE and its date of release. @@ -70,16 +66,11 @@ so a calling program can replace them if it wishes to intercept the calls. This should be done before calling any PCRE functions. -The other global variables are character tables. They are initialized when PCRE -is compiled, from source that is generated by reference to the C character type -functions, but which a user of PCRE is free to modify. In principle the tables -could also be modified at run time. See PCRE's README file for more details. - .SH MULTI-THREADING The PCRE functions can be used in multi-threading applications, with the -proviso that the character tables and the memory management functions pointed -to by \fBpcre_malloc\fR and \fBpcre_free\fR are shared by all threads. +proviso that the memory management functions pointed to by \fBpcre_malloc\fR +and \fBpcre_free\fR are shared by all threads. The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be used by several threads at once. @@ -88,10 +79,12 @@ .SH COMPILING A PATTERN The function \fBpcre_compile()\fR is called to compile a pattern into an internal form. The pattern is a C string terminated by a binary zero, and -is passed in the argument \fIpattern\fR. A pointer to the compiled code block -is returned. The \fBpcre\fR type is defined for this for convenience, but in -fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the contents of the -block are not defined. +is passed in the argument \fIpattern\fR. A pointer to a single block of memory +that is obtained via \fBpcre_malloc\fR is returned. This contains the +compiled code and related data. The \fBpcre\fR type is defined for this for +convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the +contents of the block are not externally defined. It is up to the caller to +free the memory when it is no longer required. .PP The size of a compiled pattern is roughly proportional to the length of the pattern string, except that each character class (other than those containing @@ -111,11 +104,14 @@ If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual -error message. - -The offset from the start of the pattern to the character where the error was -discovered is placed in the variable pointed to by \fIerroffset\fR, which must -not be NULL. If it is, an immediate error is given. +error message. The offset from the start of the pattern to the character where +the error was discovered is placed in the variable pointed to by +\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. +.PP +If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of +character tables which are built when it is compiled, using the default C +locale. Otherwise, \fItableptr\fR must be the result of a call to +\fBpcre_maketables()\fR. See the section on locale support below. .PP The following option bits are defined in the header file: @@ -210,6 +206,33 @@ characters is created. +.SH LOCALE SUPPORT +PCRE handles caseless matching, and determines whether characters are letters, +digits, or whatever, by reference to a set of tables. The library contains a +default set of tables which is created in the default C locale when PCRE is +compiled. This is used when the final argument of \fBpcre_compile()\fR is NULL, +and is sufficient for many applications. + +An alternative set of tables can, however, be supplied. Such tables are built +by calling the \fBpcre_maketables()\fR function, which has no arguments, in the +relevant locale. The result can then be passed to \fBpcre_compile()\ as often +as necessary. For example, to build and use tables that are appropriate for the +French locale (where accented characters with codes greater than 128 are +treated as letters), the following code could be used: + + setlocale(LC_CTYPE, "fr"); + tables = pcre_maketables(); + re = pcre_compile(..., tables); + +The tables are built in memory that is obtained via \fBpcre_malloc\fR. The +pointer that is passed to \fBpcre_compile\fR is saved with the compiled +pattern, and the same tables are used via this pointer by \fBpcre_study()\fR +and \fBpcre_match()\fR. Thus for any single pattern, compilation, studying and +matching all happen in the same locale, but different patterns can be compiled +in different locales. It is the caller's responsibility to ensure that the +memory containing the tables remains available for as long as it is needed. + + .SH MATCHING A PATTERN The function \fBpcre_exec()\fR is called to match a subject string against a pre-compiled pattern, which is passed in the \fIcode\fR argument. If the @@ -579,11 +602,16 @@ two disjoint sets. Any given character matches one, and only one, of each pair. A "word" character is any letter or digit or the underscore character, that is, -any character which can be part of a Perl "word". These character type -sequences can appear both inside and outside character classes. They each match -one character of the appropriate type. If the current matching point is at the -end of the subject string, all of them fail, since there is no character to -match. +any character which can be part of a Perl "word". The definition of letters and +digits is controlled by PCRE's character tables, and may vary if locale- +specific matching is taking place (see "Locale support" above). For example, in +the "fr" (French) locale, some character codes greater than 128 are used for +accented letters, and these are matched by \\w. + +These character type sequences can appear both inside and outside character +classes. They each match one character of the appropriate type. If the current +matching point is at the end of the subject string, all of them fail, since +there is no character to match. The fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, @@ -682,10 +710,10 @@ still consumes a character from the subject string, and fails if the current pointer is at the end of the string. -When PCRE_CASELESS is set, any letters in a class represent both their upper -case and lower case versions, so for example, a caseless [aeiou] matches "A" as -well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful -version would. +When caseless matching is set, any letters in a class represent both their +upper case and lower case versions, so for example, a caseless [aeiou] matches +"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a +caseful version would. The newline character is never treated in any special way in character classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class @@ -702,9 +730,11 @@ range. Ranges operate in ASCII collating sequence. They can also be used for -characters specified numerically, for example [\\000-\\037]. If a range such as -[W-c] is used when PCRE_CASELESS is set, it matches the letters involved in -either case, so is equivalent to [][\\^_`wxyzabc], matched caselessly. +characters specified numerically, for example [\\000-\\037]. If a range that +includes letters is used when caseless matching is set, it matches the letters +in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched +caselessly, and if character tables for the "fr" locale are in use, +[\\xc8-\\xcb] matches accented E characters in both cases. The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a character class, and add the characters that they match to the class. For