/[pcre]/code/trunk/doc/pcre16.3
ViewVC logotype

Contents of /code/trunk/doc/pcre16.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 846 - (show annotations)
Tue Jan 3 13:57:27 2012 UTC (7 years, 8 months ago) by ph10
File size: 8507 byte(s)
Documentation update for 16-bit.
1 .TH PCRE 3
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "THE PCRE 16-BIT LIBRARY"
5 .rs
6 .sp
7 Starting with release 8.30, it is possible to compile a PCRE library that
8 supports 16-bit character strings, including UTF-16 strings, as well as or
9 instead of the original 8-bit library. The majority of the work to make this
10 possible was done by Zoltan Herczeg. The two libraries contain identical sets
11 of functions, used in exactly the same way. Only the names of the functions and
12 the data types of their string arguments are different. To avoid
13 over-complication and reduce the documentation maintenance load, most of the
14 documentation describes the 8-bit library, with only occasional references to
15 the 16-bit library. This page describes what is different when you use the
16 16-bit library.
17 .P
18 WARNING: A single application can be linked with both libraries, but you must
19 take care when processing any particular pattern to use functions from just one
20 library. For example, if you want to study a pattern that was compiled with
21 \fBpcre16_compile()\fP, you must do so with \fBpcre16_study()\fP, not
22 \fBpcre_study()\fP, and you must free the study data with
23 \fBpcre16_free_study()\fP.
24 .
25 .
26 .SH "THE HEADER FILE"
27 .rs
28 .sp
29 There is only one header file, \fBpcre.h\fP. It contains prototypes for all the
30 functions in both libraries, as well as definitions of flags, error codes, etc.
31 .
32 .
33 .SH "STRING TYPES"
34 .rs
35 .sp
36 In the 8-bit library, strings are passed to PCRE library functions as vectors
37 of bytes with the C type "char *". In the 16-bit library, strings are passed as
38 vectors of unsigned 16-bit quantities. The macro PCRE_SCHAR16 specifies an
39 appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_SCHAR16 *". In
40 very many environments, "short int" is a 16-bit data type. When PCRE is built,
41 it defines PCRE_SCHAR16 as "short int", but checks that it really is a 16-bit
42 data type. If it is not, the build fails with an error message telling the
43 maintainer to modify the definition appropriately.
44 .
45 .
46 .SH "16-BIT FUNCTIONS WITH DIFFERING ARGUMENT TYPES"
47 .rs
48 .sp
49 For every function in the 8-bit library there is a corresponding function in
50 the 16-bit library with a name that starts with \fBpcre16_\fP instead of
51 \fBpcre_\fP. All of these functions have the same number of arguments, and
52 yield the same results. Many of them also have exactly the same argument types.
53 Those that differ are as follows:
54
55 \fBpcre16_compile()\fP and \fBpcre16_compile2()\fP: the type of the first
56 argument must be PCRE_SPTR16 instead of "const char *".
57
58 \fBpcre16_exec()\fP and \fBpcre16_dfa_exec()\fP: the type of the third argument
59 must be PCRE_SPTR16 instead of "const char *".
60
61 \fBpcre16_copy_named_substring()\fP: the type of the second and fifth agruments
62 must be PCRE_SPTR16 instead of "const char *" and the type of the sixth
63 argument must be "PCRE_SCHAR16 *" instead of "char *".
64
65 \fBpcre16_copy_substring()\fP: the type of the first argument must be
66 PCRE_SPTR16 instead of "const char *" and the type of the fifth argument must
67 be "PCRE_SCHAR16 *" instead of "char *".
68
69 \fBpcre16_get_named_substring()\fP: the type of the second and fifth agruments
70 must be PCRE_SPTR16 instead of "const char *" and the type of the sixth
71 argument must be "PCRE_SPTR16 *" instead of "const char **".
72
73 \fBpcre16_get_substring()\fP: the type of the first argument must be
74 PCRE_SPTR16 instead of "const char *" and the type of the fifth argument must
75 be "PCRE_SPTR16 *" intead of "const char **".
76
77 \fBpcre16_free_substring()\fP: the type of the argument must be PCRE_SPTR16
78 instead of "const char *".
79
80 \fBpcre16_get_substring_list()\fP: the type of the first argument must be
81 PCRE_SPTR16 intead of "const char *", and the type of the fourth argument must
82 be "PCRE_SPTR16 **" intead of "const char ***".
83
84 \fBpcre16_free_substring_list()\fP: the type of the argument must be
85 "PCRE_SPTR16 *" instead of "const char **".
86
87 \fBpcre16_get_stringnumber()\fP: the type of the second argument must be
88 PCRE_SPTR16 instead of "const char *".
89
90 \fBpcre16_get_stringtable_entries()\fP: the types of the second, third, and
91 fourth arguments must be PCRE_SPTR16, "PCRE_SCHAR16 **", and "PCRE_SCHAR16 **"
92 intead of "const char *", "char **", and "char **".
93 .
94 .
95 .SH "SUBJECT STRING OFFSETS"
96 .rs
97 .sp
98 The offsets within subject strings that are returned by the matching functions
99 are in 16-bit units rather than bytes.
100 .
101 .
102 .SH "NAMED SUBPATTERNS"
103 .rs
104 .sp
105 The name-to-number translation table that is maintained for named subpatterns
106 uses 16-bit characters. The \fBpcre16_get_stringtable_entries()\fP function
107 returns the length of each entry in the table as the number of 16-bit data
108 items.
109 .
110 .
111 .SH "OPTION NAMES"
112 .rs
113 .sp
114 There are two new general option names, PCRE_UTF16 and PCRE_NO_UTF16_CHECK,
115 which correspond to PCRE_UTF8 and PCRE_NO_UTF8_CHECK in the 8-bit library. In
116 fact, these new options define the same bits in the options word.
117 .P
118 For the \fBpcre16_config()\fP function there is an option PCRE_CONFIG_UTF16
119 that returns 1 if UTF-16 support is configured, otherwise 0. If this option is
120 given to \fBpcre_config()\fP, or if the PCRE_CONFIG_UTF8 option is given to
121 \fBpcre16_config()\fP, the result is the PCRE_ERROR_BADOPTION error.
122 .
123 .
124 .SH "CHARACTER CODES"
125 .rs
126 .sp
127 In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the
128 same way as in 8-bit, non UTF-8 mode, except, of course, that they can range
129 from 0 to 0xFFFF instead of 0 to 0xFF. Character types for characters less than
130 0xFF can therefore be influenced by the locale in the same way as before.
131 Characters greater than 0xFF have only one case, and no "type" (such as letter
132 or digit).
133 .P
134 In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10FFFF, with
135 the exception of values in the range 0xD800 to 0xDFFF because those are
136 "surrogate" values that are used in pairs to encode values greater than 0xFFFF.
137 .P
138 A UTF-16 string can indicate its endianness by special code knows as BOM at its
139 start. The PCRE functions do not handle this. However a function called
140 \fBpcre16_utf16_to_host_byte_order()\fP is provided. It checks the byte order
141 of a UTF-16 string and converts it if necessary, optionally removing the BOM
142 data. It is documented with all the other functions in the
143 .\" HREF
144 \fBpcreapi\fP
145 .\"
146 page.
147 .
148 .
149 .SH "ERROR NAMES"
150 .rs
151 .sp
152 The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to
153 their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled
154 pattern is passed to a function that processes patterns in the other
155 mode, for example, if a pattern compiled with \fBpcre_compile()\fP is passed to
156 \fBpcre16_exec()\fP.
157 .P
158 There are new error codes whose names begin with PCRE_UTF16_ERR for invalid
159 UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings.
160 They are documented in the
161 .\" HREF
162 \fBpcreapi\fP
163 .\"
164 page.
165 .
166 .
167 .SH "ERROR TEXTS"
168 .rs
169 .sp
170 If there is an error while compiling a pattern, the error text that is passed
171 back by \fBpcre16_compile()\fP or \fBpcre16_compile2()\fP is still an 8-bit
172 character string, zero-terminated.
173 .
174 .
175 .SH "CALLOUTS"
176 .rs
177 .sp
178 The \fIsubject\fP and \fImark\fP fields in the callout block that is passed to
179 a callout function point to 16-bit vectors.
180 .
181 .
182 .SH "TESTING"
183 .rs
184 .sp
185 The \fBpcretest\fP program continues to operate with 8-bit input and output
186 files, but it can be used for testing the 16-bit library. If it is run with the
187 command line option \fB-16\fP, patterns and subject strings are converted from
188 8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions
189 are used instead of the 8-bit ones. Returned 16-bit strings are converted to
190 8-bit for output. If the 8-bit library was not compiled, \fBpcretest\fP
191 defaults to 16-bit and the \fB-16\fP option is ignored.
192 .P
193 When PCRE is being built, the \fBRunTest\fP script that is called by "make
194 check" uses the \fBpcretest\fP \fB-C\fP option to discover which of the 8-bit
195 and 16-bit libraries has been built, and runs the tests appropriately.
196 .
197 .
198 .SH "NOT SUPPORTED IN 16-BIT MODE"
199 .rs
200 .sp
201 Not all the features of the 8-bit library are available with the 16-bit
202 library. The C++ and POSIX wrapper functions support only the 8-bit library,
203 and the \fBpcregrep\fP program is at present 8-bit only.
204 .
205 .
206 .SH AUTHOR
207 .rs
208 .sp
209 .nf
210 Philip Hazel
211 University Computing Service
212 Cambridge CB2 3QH, England.
213 .fi
214 .
215 .
216 .SH REVISION
217 .rs
218 .sp
219 .nf
220 Last updated: 03 January 2012
221 Copyright (c) 1997-2012 University of Cambridge.
222 .fi

  ViewVC Help
Powered by ViewVC 1.1.5