/[pcre]/code/trunk/doc/pcre16.3
ViewVC logotype

Diff of /code/trunk/doc/pcre16.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 902 by ph10, Mon Jan 9 17:43:54 2012 UTC revision 903 by ph10, Sat Jan 21 16:37:17 2012 UTC
# Line 139  PCRE - Perl-compatible regular expressio Line 139  PCRE - Perl-compatible regular expressio
139  .sp  .sp
140  .B int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *\fIoutput\fP,  .B int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *\fIoutput\fP,
141  .ti +5n  .ti +5n
142  .B PCRE_SPTR16 \fIinput\fP, int \fIlength\fP, int *\fIbyte_order\fP,  .B PCRE_SPTR16 \fIinput\fP, int \fIlength\fP, int *\fIbyte_order\fP,
143  .ti +5n  .ti +5n
144  .B int \fIkeep_boms\fP);  .B int \fIkeep_boms\fP);
145  .  .
# Line 158  PCRE documentation describes the 8-bit l Line 158  PCRE documentation describes the 8-bit l
158  to the 16-bit library. This page describes what is different when you use the  to the 16-bit library. This page describes what is different when you use the
159  16-bit library.  16-bit library.
160  .P  .P
161  WARNING: A single application can be linked with both libraries, but you must  WARNING: A single application can be linked with both libraries, but you must
162  take care when processing any particular pattern to use functions from just one  take care when processing any particular pattern to use functions from just one
163  library. For example, if you want to study a pattern that was compiled with  library. For example, if you want to study a pattern that was compiled with
164  \fBpcre16_compile()\fP, you must do so with \fBpcre16_study()\fP, not  \fBpcre16_compile()\fP, you must do so with \fBpcre16_study()\fP, not
165  \fBpcre_study()\fP, and you must free the study data with  \fBpcre_study()\fP, and you must free the study data with
# Line 169  library. For example, if you want to stu Line 169  library. For example, if you want to stu
169  .SH "THE HEADER FILE"  .SH "THE HEADER FILE"
170  .rs  .rs
171  .sp  .sp
172  There is only one header file, \fBpcre.h\fP. It contains prototypes for all the  There is only one header file, \fBpcre.h\fP. It contains prototypes for all the
173  functions in both libraries, as well as definitions of flags, structures, error  functions in both libraries, as well as definitions of flags, structures, error
174  codes, etc.  codes, etc.
175  .  .
# Line 177  codes, etc. Line 177  codes, etc.
177  .SH "THE LIBRARY NAME"  .SH "THE LIBRARY NAME"
178  .rs  .rs
179  .sp  .sp
180  In Unix-like systems, the 16-bit library is called \fBlibpcre16\fP, and can  In Unix-like systems, the 16-bit library is called \fBlibpcre16\fP, and can
181  normally be accesss by adding \fB-lpcre16\fP to the command for linking an  normally be accesss by adding \fB-lpcre16\fP to the command for linking an
182  application that uses PCRE.  application that uses PCRE.
183  .  .
184  .  .
185  .SH "STRING TYPES"  .SH "STRING TYPES"
186  .rs  .rs
187  .sp  .sp
188  In the 8-bit library, strings are passed to PCRE library functions as vectors  In the 8-bit library, strings are passed to PCRE library functions as vectors
189  of bytes with the C type "char *". In the 16-bit library, strings are passed as  of bytes with the C type "char *". In the 16-bit library, strings are passed as
190  vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an  vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an
191  appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In  appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In
192  very many environments, "short int" is a 16-bit data type. When PCRE is built,  very many environments, "short int" is a 16-bit data type. When PCRE is built,
193  it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit  it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit
194  data type. If it is not, the build fails with an error message telling the  data type. If it is not, the build fails with an error message telling the
195  maintainer to modify the definition appropriately.  maintainer to modify the definition appropriately.
196  .  .
197  .  .
198  .SH "STRUCTURE TYPES"  .SH "STRUCTURE TYPES"
199  .rs  .rs
200  .sp  .sp
201  The types of the opaque structures that are used for compiled 16-bit patterns  The types of the opaque structures that are used for compiled 16-bit patterns
202  and JIT stacks are \fBpcre16\fP and \fBpcre16_jit_stack\fP respectively. The  and JIT stacks are \fBpcre16\fP and \fBpcre16_jit_stack\fP respectively. The
203  type of the user-accessible structure that is returned by \fBpcre16_study()\fP  type of the user-accessible structure that is returned by \fBpcre16_study()\fP
204  is \fBpcre16_extra\fP, and the type of the structure that is used for passing  is \fBpcre16_extra\fP, and the type of the structure that is used for passing
205  data to a callout function is \fBpcre16_callout_block\fP. These structures  data to a callout function is \fBpcre16_callout_block\fP. These structures
206  contain the same fields, with the same names, as their 8-bit counterparts. The  contain the same fields, with the same names, as their 8-bit counterparts. The
207  only difference is that pointers to character strings are 16-bit instead of  only difference is that pointers to character strings are 16-bit instead of
208  8-bit types.  8-bit types.
209  .  .
210  .  .
# Line 212  only difference is that pointers to char Line 212  only difference is that pointers to char
212  .rs  .rs
213  .sp  .sp
214  For every function in the 8-bit library there is a corresponding function in  For every function in the 8-bit library there is a corresponding function in
215  the 16-bit library with a name that starts with \fBpcre16_\fP instead of  the 16-bit library with a name that starts with \fBpcre16_\fP instead of
216  \fBpcre_\fP. The prototypes are listed above. In addition, there is one extra  \fBpcre_\fP. The prototypes are listed above. In addition, there is one extra
217  function, \fBpcre16_utf16_to_host_byte_order()\fP. This is a utility function  function, \fBpcre16_utf16_to_host_byte_order()\fP. This is a utility function
218  that converts a UTF-16 character string to host byte order if necessary. The  that converts a UTF-16 character string to host byte order if necessary. The
219  other 16-bit functions expect the strings they are passed to be in host byte  other 16-bit functions expect the strings they are passed to be in host byte
220  order.  order.
221  .P  .P
222  The \fIinput\fP and \fIoutput\fP arguments of  The \fIinput\fP and \fIoutput\fP arguments of
223  \fBpcre16_utf16_to_host_byte_order()\fP may point to the same address, that is,  \fBpcre16_utf16_to_host_byte_order()\fP may point to the same address, that is,
224  conversion in place is supported. The output buffer must be at least as long as  conversion in place is supported. The output buffer must be at least as long as
225  the input.  the input.
226  .P  .P
227  The \fIlength\fP argument specifies the number of 16-bit data units in the  The \fIlength\fP argument specifies the number of 16-bit data units in the
228  input string; a negative value specifies a zero-terminated string.  input string; a negative value specifies a zero-terminated string.
229  .P  .P
230  If \fIbyte_order\fP is NULL, it is assumed that the string starts off in host  If \fIbyte_order\fP is NULL, it is assumed that the string starts off in host
231  byte order. This may be changed by byte-order marks (BOMs) anywhere in the  byte order. This may be changed by byte-order marks (BOMs) anywhere in the
232  string (commonly as the first character).  string (commonly as the first character).
233  .P  .P
234  If \fIbyte_order\fP is not NULL, a non-zero value of the integer to which it  If \fIbyte_order\fP is not NULL, a non-zero value of the integer to which it
235  points means that the input starts off in host byte order, otherwise the  points means that the input starts off in host byte order, otherwise the
236  opposite order is assumed. Again, BOMs in the string can change this. The final  opposite order is assumed. Again, BOMs in the string can change this. The final
237  byte order is passed back at the end of processing.  byte order is passed back at the end of processing.
238  .P  .P
239  If \fIkeep_boms\fP is not zero, byte-order mark characters (0xfeff) are copied  If \fIkeep_boms\fP is not zero, byte-order mark characters (0xfeff) are copied
240  into the output string. Otherwise they are discarded.  into the output string. Otherwise they are discarded.
241  .P  .P
242  The result of the function is the number of 16-bit units placed into the output  The result of the function is the number of 16-bit units placed into the output
# Line 246  buffer, including the zero terminator if Line 246  buffer, including the zero terminator if
246  .SH "SUBJECT STRING OFFSETS"  .SH "SUBJECT STRING OFFSETS"
247  .rs  .rs
248  .sp  .sp
249  The offsets within subject strings that are returned by the matching functions  The offsets within subject strings that are returned by the matching functions
250  are in 16-bit units rather than bytes.  are in 16-bit units rather than bytes.
251  .  .
252  .  .
253  .SH "NAMED SUBPATTERNS"  .SH "NAMED SUBPATTERNS"
254  .rs  .rs
255  .sp  .sp
256  The name-to-number translation table that is maintained for named subpatterns  The name-to-number translation table that is maintained for named subpatterns
257  uses 16-bit characters. The \fBpcre16_get_stringtable_entries()\fP function  uses 16-bit characters. The \fBpcre16_get_stringtable_entries()\fP function
258  returns the length of each entry in the table as the number of 16-bit data  returns the length of each entry in the table as the number of 16-bit data
259  units.  units.
260  .  .
261  .  .
# Line 266  There are two new general option names, Line 266  There are two new general option names,
266  which correspond to PCRE_UTF8 and PCRE_NO_UTF8_CHECK in the 8-bit library. In  which correspond to PCRE_UTF8 and PCRE_NO_UTF8_CHECK in the 8-bit library. In
267  fact, these new options define the same bits in the options word.  fact, these new options define the same bits in the options word.
268  .P  .P
269  For the \fBpcre16_config()\fP function there is an option PCRE_CONFIG_UTF16  For the \fBpcre16_config()\fP function there is an option PCRE_CONFIG_UTF16
270  that returns 1 if UTF-16 support is configured, otherwise 0. If this option is  that returns 1 if UTF-16 support is configured, otherwise 0. If this option is
271  given to \fBpcre_config()\fP, or if the PCRE_CONFIG_UTF8 option is given to  given to \fBpcre_config()\fP, or if the PCRE_CONFIG_UTF8 option is given to
272  \fBpcre16_config()\fP, the result is the PCRE_ERROR_BADOPTION error.  \fBpcre16_config()\fP, the result is the PCRE_ERROR_BADOPTION error.
# Line 275  given to \fBpcre_config()\fP, or if the Line 275  given to \fBpcre_config()\fP, or if the
275  .SH "CHARACTER CODES"  .SH "CHARACTER CODES"
276  .rs  .rs
277  .sp  .sp
278  In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the  In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the
279  same way as in 8-bit, non UTF-8 mode, except, of course, that they can range  same way as in 8-bit, non UTF-8 mode, except, of course, that they can range
280  from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than  from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than
281  0xff can therefore be influenced by the locale in the same way as before.  0xff can therefore be influenced by the locale in the same way as before.
282  Characters greater than 0xff have only one case, and no "type" (such as letter  Characters greater than 0xff have only one case, and no "type" (such as letter
283  or digit).  or digit).
284  .P  .P
285  In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with  In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with
286  the exception of values in the range 0xd800 to 0xdfff because those are  the exception of values in the range 0xd800 to 0xdfff because those are
287  "surrogate" values that are used in pairs to encode values greater than 0xffff.  "surrogate" values that are used in pairs to encode values greater than 0xffff.
288  .P  .P
289  A UTF-16 string can indicate its endianness by special code knows as a  A UTF-16 string can indicate its endianness by special code knows as a
290  byte-order mark (BOM). The PCRE functions do not handle this, expecting strings  byte-order mark (BOM). The PCRE functions do not handle this, expecting strings
291  to be in host byte order. A utility function called  to be in host byte order. A utility function called
292  \fBpcre16_utf16_to_host_byte_order()\fP is provided to help with this (see  \fBpcre16_utf16_to_host_byte_order()\fP is provided to help with this (see
# Line 296  above). Line 296  above).
296  .SH "ERROR NAMES"  .SH "ERROR NAMES"
297  .rs  .rs
298  .sp  .sp
299  The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to  The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to
300  their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled  their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled
301  pattern is passed to a function that processes patterns in the other  pattern is passed to a function that processes patterns in the other
302  mode, for example, if a pattern compiled with \fBpcre_compile()\fP is passed to  mode, for example, if a pattern compiled with \fBpcre_compile()\fP is passed to
303  \fBpcre16_exec()\fP.  \fBpcre16_exec()\fP.
304  .P  .P
305  There are new error codes whose names begin with PCRE_UTF16_ERR for invalid  There are new error codes whose names begin with PCRE_UTF16_ERR for invalid
306  UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that  UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that
307  are described in the section entitled  are described in the section entitled
308  .\" HTML <a href="pcreapi.html#badutf8reasons">  .\" HTML <a href="pcreapi.html#badutf8reasons">
309  .\" </a>  .\" </a>
310  "Reason codes for invalid UTF-8 strings"  "Reason codes for invalid UTF-8 strings"
311  .\"  .\"
312  in the main  in the main
313  .\" HREF  .\" HREF
314  \fBpcreapi\fP  \fBpcreapi\fP
315  .\"  .\"
# Line 324  page. The UTF-16 errors are: Line 324  page. The UTF-16 errors are:
324  .SH "ERROR TEXTS"  .SH "ERROR TEXTS"
325  .rs  .rs
326  .sp  .sp
327  If there is an error while compiling a pattern, the error text that is passed  If there is an error while compiling a pattern, the error text that is passed
328  back by \fBpcre16_compile()\fP or \fBpcre16_compile2()\fP is still an 8-bit  back by \fBpcre16_compile()\fP or \fBpcre16_compile2()\fP is still an 8-bit
329  character string, zero-terminated.  character string, zero-terminated.
330  .  .
331  .  .
# Line 339  a callout function point to 16-bit vecto Line 339  a callout function point to 16-bit vecto
339  .SH "TESTING"  .SH "TESTING"
340  .rs  .rs
341  .sp  .sp
342  The \fBpcretest\fP program continues to operate with 8-bit input and output  The \fBpcretest\fP program continues to operate with 8-bit input and output
343  files, but it can be used for testing the 16-bit library. If it is run with the  files, but it can be used for testing the 16-bit library. If it is run with the
344  command line option \fB-16\fP, patterns and subject strings are converted from  command line option \fB-16\fP, patterns and subject strings are converted from
345  8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions  8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions
346  are used instead of the 8-bit ones. Returned 16-bit strings are converted to  are used instead of the 8-bit ones. Returned 16-bit strings are converted to
347  8-bit for output. If the 8-bit library was not compiled, \fBpcretest\fP  8-bit for output. If the 8-bit library was not compiled, \fBpcretest\fP
348  defaults to 16-bit and the \fB-16\fP option is ignored.  defaults to 16-bit and the \fB-16\fP option is ignored.
349  .P  .P
350  When PCRE is being built, the \fBRunTest\fP script that is called by "make  When PCRE is being built, the \fBRunTest\fP script that is called by "make
351  check" uses the \fBpcretest\fP \fB-C\fP option to discover which of the 8-bit  check" uses the \fBpcretest\fP \fB-C\fP option to discover which of the 8-bit
352  and 16-bit libraries has been built, and runs the tests appropriately.  and 16-bit libraries has been built, and runs the tests appropriately.
353  .  .
# Line 355  and 16-bit libraries has been built, and Line 355  and 16-bit libraries has been built, and
355  .SH "NOT SUPPORTED IN 16-BIT MODE"  .SH "NOT SUPPORTED IN 16-BIT MODE"
356  .rs  .rs
357  .sp  .sp
358  Not all the features of the 8-bit library are available with the 16-bit  Not all the features of the 8-bit library are available with the 16-bit
359  library. The C++ and POSIX wrapper functions support only the 8-bit library,  library. The C++ and POSIX wrapper functions support only the 8-bit library,
360  and the \fBpcregrep\fP program is at present 8-bit only.  and the \fBpcregrep\fP program is at present 8-bit only.
361  .  .
362  .  .

Legend:
Removed from v.902  
changed lines
  Added in v.903

  ViewVC Help
Powered by ViewVC 1.1.5