/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1320 - (show annotations)
Wed May 1 16:39:35 2013 UTC (6 years, 3 months ago) by ph10
File MIME type: text/plain
File size: 488829 byte(s)
Source tidies (trails spaces, html updates) for 8.33-RC1.
1 -----------------------------------------------------------------------------
2 This file contains a concatenation of the PCRE man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
5 synopses of each function in the library have not been included. Neither has
6 the pcredemo program. There are separate text files for the pcregrep and
7 pcretest commands.
8 -----------------------------------------------------------------------------
9
10
11 PCRE(3) Library Functions Manual PCRE(3)
12
13
14
15 NAME
16 PCRE - Perl-compatible regular expressions
17
18 INTRODUCTION
19
20 The PCRE library is a set of functions that implement regular expres-
21 sion pattern matching using the same syntax and semantics as Perl, with
22 just a few differences. Some features that appeared in Python and PCRE
23 before they appeared in Perl are also available using the Python syn-
24 tax, there is some support for one or two .NET and Oniguruma syntax
25 items, and there is an option for requesting some minor changes that
26 give better JavaScript compatibility.
27
28 Starting with release 8.30, it is possible to compile two separate PCRE
29 libraries: the original, which supports 8-bit character strings
30 (including UTF-8 strings), and a second library that supports 16-bit
31 character strings (including UTF-16 strings). The build process allows
32 either one or both to be built. The majority of the work to make this
33 possible was done by Zoltan Herczeg.
34
35 Starting with release 8.32 it is possible to compile a third separate
36 PCRE library, which supports 32-bit character strings (including UTF-32
37 strings). The build process allows any set of the 8-, 16- and 32-bit
38 libraries. The work to make this possible was done by Christian Persch.
39
40 The three libraries contain identical sets of functions, except that
41 the names in the 16-bit library start with pcre16_ instead of pcre_,
42 and the names in the 32-bit library start with pcre32_ instead of
43 pcre_. To avoid over-complication and reduce the documentation mainte-
44 nance load, most of the documentation describes the 8-bit library, with
45 the differences for the 16-bit and 32-bit libraries described sepa-
46 rately in the pcre16 and pcre32 pages. References to functions or
47 structures of the form pcre[16|32]_xxx should be read as meaning
48 "pcre_xxx when using the 8-bit library, pcre16_xxx when using the
49 16-bit library, or pcre32_xxx when using the 32-bit library".
50
51 The current implementation of PCRE corresponds approximately with Perl
52 5.12, including support for UTF-8/16/32 encoded strings and Unicode
53 general category properties. However, UTF-8/16/32 and Unicode support
54 has to be explicitly enabled; it is not the default. The Unicode tables
55 correspond to Unicode release 6.2.0.
56
57 In addition to the Perl-compatible matching function, PCRE contains an
58 alternative function that matches the same compiled patterns in a dif-
59 ferent way. In certain circumstances, the alternative function has some
60 advantages. For a discussion of the two matching algorithms, see the
61 pcrematching page.
62
63 PCRE is written in C and released as a C library. A number of people
64 have written wrappers and interfaces of various kinds. In particular,
65 Google Inc. have provided a comprehensive C++ wrapper for the 8-bit
66 library. This is now included as part of the PCRE distribution. The
67 pcrecpp page has details of this interface. Other people's contribu-
68 tions can be found in the Contrib directory at the primary FTP site,
69 which is:
70
71 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
72
73 Details of exactly which Perl regular expression features are and are
74 not supported by PCRE are given in separate documents. See the pcrepat-
75 tern and pcrecompat pages. There is a syntax summary in the pcresyntax
76 page.
77
78 Some features of PCRE can be included, excluded, or changed when the
79 library is built. The pcre_config() function makes it possible for a
80 client to discover which features are available. The features them-
81 selves are described in the pcrebuild page. Documentation about build-
82 ing PCRE for various operating systems can be found in the README and
83 NON-AUTOTOOLS_BUILD files in the source distribution.
84
85 The libraries contains a number of undocumented internal functions and
86 data tables that are used by more than one of the exported external
87 functions, but which are not intended for use by external callers.
88 Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which
89 hopefully will not provoke any name clashes. In some environments, it
90 is possible to control which external symbols are exported when a
91 shared library is built, and in these cases the undocumented symbols
92 are not exported.
93
94
95 SECURITY CONSIDERATIONS
96
97 If you are using PCRE in a non-UTF application that permits users to
98 supply arbitrary patterns for compilation, you should be aware of a
99 feature that allows users to turn on UTF support from within a pattern,
100 provided that PCRE was built with UTF support. For example, an 8-bit
101 pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
102 which interprets patterns and subjects as strings of UTF-8 characters
103 instead of individual 8-bit characters. This causes both the pattern
104 and any data against which it is matched to be checked for UTF-8 valid-
105 ity. If the data string is very long, such a check might use suffi-
106 ciently many resources as to cause your application to lose perfor-
107 mance.
108
109 One way of guarding against this possibility is to use the
110 pcre_fullinfo() function to check the compiled pattern's options for
111 UTF. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF
112 option at compile time. This causes an compile time error if a pattern
113 contains a UTF-setting sequence.
114
115 If your application is one that supports UTF, be aware that validity
116 checking can take time. If the same data string is to be matched many
117 times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
118 and subsequent matches to save redundant checks.
119
120 Another way that performance can be hit is by running a pattern that
121 has a very large search tree against a string that will never match.
122 Nested unlimited repeats in a pattern are a common example. PCRE pro-
123 vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
124 ture in the pcreapi page.
125
126
127 USER DOCUMENTATION
128
129 The user documentation for PCRE comprises a number of different sec-
130 tions. In the "man" format, each of these is a separate "man page". In
131 the HTML format, each is a separate page, linked from the index page.
132 In the plain text format, all the sections, except the pcredemo sec-
133 tion, are concatenated, for ease of searching. The sections are as fol-
134 lows:
135
136 pcre this document
137 pcre16 details of the 16-bit library
138 pcre32 details of the 32-bit library
139 pcre-config show PCRE installation configuration information
140 pcreapi details of PCRE's native C API
141 pcrebuild options for building PCRE
142 pcrecallout details of the callout feature
143 pcrecompat discussion of Perl compatibility
144 pcrecpp details of the C++ wrapper for the 8-bit library
145 pcredemo a demonstration C program that uses PCRE
146 pcregrep description of the pcregrep command (8-bit only)
147 pcrejit discussion of the just-in-time optimization support
148 pcrelimits details of size and other limits
149 pcrematching discussion of the two matching algorithms
150 pcrepartial details of the partial matching facility
151 pcrepattern syntax and semantics of supported
152 regular expressions
153 pcreperform discussion of performance issues
154 pcreposix the POSIX-compatible C API for the 8-bit library
155 pcreprecompile details of saving and re-using precompiled patterns
156 pcresample discussion of the pcredemo program
157 pcrestack discussion of stack usage
158 pcresyntax quick syntax reference
159 pcretest description of the pcretest testing command
160 pcreunicode discussion of Unicode and UTF-8/16/32 support
161
162 In addition, in the "man" and HTML formats, there is a short page for
163 each C library function, listing its arguments and results.
164
165
166 AUTHOR
167
168 Philip Hazel
169 University Computing Service
170 Cambridge CB2 3QH, England.
171
172 Putting an actual email address here seems to have been a spam magnet,
173 so I've taken it away. If you want to email me, use my two initials,
174 followed by the two digits 10, at the domain cam.ac.uk.
175
176
177 REVISION
178
179 Last updated: 26 April 2013
180 Copyright (c) 1997-2013 University of Cambridge.
181 ------------------------------------------------------------------------------
182
183
184 PCRE(3) Library Functions Manual PCRE(3)
185
186
187
188 NAME
189 PCRE - Perl-compatible regular expressions
190
191 #include <pcre.h>
192
193
194 PCRE 16-BIT API BASIC FUNCTIONS
195
196 pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
197 const char **errptr, int *erroffset,
198 const unsigned char *tableptr);
199
200 pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
201 int *errorcodeptr,
202 const char **errptr, int *erroffset,
203 const unsigned char *tableptr);
204
205 pcre16_extra *pcre16_study(const pcre16 *code, int options,
206 const char **errptr);
207
208 void pcre16_free_study(pcre16_extra *extra);
209
210 int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
211 PCRE_SPTR16 subject, int length, int startoffset,
212 int options, int *ovector, int ovecsize);
213
214 int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
215 PCRE_SPTR16 subject, int length, int startoffset,
216 int options, int *ovector, int ovecsize,
217 int *workspace, int wscount);
218
219
220 PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
221
222 int pcre16_copy_named_substring(const pcre16 *code,
223 PCRE_SPTR16 subject, int *ovector,
224 int stringcount, PCRE_SPTR16 stringname,
225 PCRE_UCHAR16 *buffer, int buffersize);
226
227 int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
228 int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
229 int buffersize);
230
231 int pcre16_get_named_substring(const pcre16 *code,
232 PCRE_SPTR16 subject, int *ovector,
233 int stringcount, PCRE_SPTR16 stringname,
234 PCRE_SPTR16 *stringptr);
235
236 int pcre16_get_stringnumber(const pcre16 *code,
237 PCRE_SPTR16 name);
238
239 int pcre16_get_stringtable_entries(const pcre16 *code,
240 PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
241
242 int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
243 int stringcount, int stringnumber,
244 PCRE_SPTR16 *stringptr);
245
246 int pcre16_get_substring_list(PCRE_SPTR16 subject,
247 int *ovector, int stringcount, PCRE_SPTR16 **listptr);
248
249 void pcre16_free_substring(PCRE_SPTR16 stringptr);
250
251 void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
252
253
254 PCRE 16-BIT API AUXILIARY FUNCTIONS
255
256 pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
257
258 void pcre16_jit_stack_free(pcre16_jit_stack *stack);
259
260 void pcre16_assign_jit_stack(pcre16_extra *extra,
261 pcre16_jit_callback callback, void *data);
262
263 const unsigned char *pcre16_maketables(void);
264
265 int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
266 int what, void *where);
267
268 int pcre16_refcount(pcre16 *code, int adjust);
269
270 int pcre16_config(int what, void *where);
271
272 const char *pcre16_version(void);
273
274 int pcre16_pattern_to_host_byte_order(pcre16 *code,
275 pcre16_extra *extra, const unsigned char *tables);
276
277
278 PCRE 16-BIT API INDIRECTED FUNCTIONS
279
280 void *(*pcre16_malloc)(size_t);
281
282 void (*pcre16_free)(void *);
283
284 void *(*pcre16_stack_malloc)(size_t);
285
286 void (*pcre16_stack_free)(void *);
287
288 int (*pcre16_callout)(pcre16_callout_block *);
289
290
291 PCRE 16-BIT API 16-BIT-ONLY FUNCTION
292
293 int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
294 PCRE_SPTR16 input, int length, int *byte_order,
295 int keep_boms);
296
297
298 THE PCRE 16-BIT LIBRARY
299
300 Starting with release 8.30, it is possible to compile a PCRE library
301 that supports 16-bit character strings, including UTF-16 strings, as
302 well as or instead of the original 8-bit library. The majority of the
303 work to make this possible was done by Zoltan Herczeg. The two
304 libraries contain identical sets of functions, used in exactly the same
305 way. Only the names of the functions and the data types of their argu-
306 ments and results are different. To avoid over-complication and reduce
307 the documentation maintenance load, most of the PCRE documentation
308 describes the 8-bit library, with only occasional references to the
309 16-bit library. This page describes what is different when you use the
310 16-bit library.
311
312 WARNING: A single application can be linked with both libraries, but
313 you must take care when processing any particular pattern to use func-
314 tions from just one library. For example, if you want to study a pat-
315 tern that was compiled with pcre16_compile(), you must do so with
316 pcre16_study(), not pcre_study(), and you must free the study data with
317 pcre16_free_study().
318
319
320 THE HEADER FILE
321
322 There is only one header file, pcre.h. It contains prototypes for all
323 the functions in all libraries, as well as definitions of flags, struc-
324 tures, error codes, etc.
325
326
327 THE LIBRARY NAME
328
329 In Unix-like systems, the 16-bit library is called libpcre16, and can
330 normally be accesss by adding -lpcre16 to the command for linking an
331 application that uses PCRE.
332
333
334 STRING TYPES
335
336 In the 8-bit library, strings are passed to PCRE library functions as
337 vectors of bytes with the C type "char *". In the 16-bit library,
338 strings are passed as vectors of unsigned 16-bit quantities. The macro
339 PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is
340 defined as "const PCRE_UCHAR16 *". In very many environments, "short
341 int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
342 as "unsigned short int", but checks that it really is a 16-bit data
343 type. If it is not, the build fails with an error message telling the
344 maintainer to modify the definition appropriately.
345
346
347 STRUCTURE TYPES
348
349 The types of the opaque structures that are used for compiled 16-bit
350 patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively.
351 The type of the user-accessible structure that is returned by
352 pcre16_study() is pcre16_extra, and the type of the structure that is
353 used for passing data to a callout function is pcre16_callout_block.
354 These structures contain the same fields, with the same names, as their
355 8-bit counterparts. The only difference is that pointers to character
356 strings are 16-bit instead of 8-bit types.
357
358
359 16-BIT FUNCTIONS
360
361 For every function in the 8-bit library there is a corresponding func-
362 tion in the 16-bit library with a name that starts with pcre16_ instead
363 of pcre_. The prototypes are listed above. In addition, there is one
364 extra function, pcre16_utf16_to_host_byte_order(). This is a utility
365 function that converts a UTF-16 character string to host byte order if
366 necessary. The other 16-bit functions expect the strings they are
367 passed to be in host byte order.
368
369 The input and output arguments of pcre16_utf16_to_host_byte_order() may
370 point to the same address, that is, conversion in place is supported.
371 The output buffer must be at least as long as the input.
372
373 The length argument specifies the number of 16-bit data units in the
374 input string; a negative value specifies a zero-terminated string.
375
376 If byte_order is NULL, it is assumed that the string starts off in host
377 byte order. This may be changed by byte-order marks (BOMs) anywhere in
378 the string (commonly as the first character).
379
380 If byte_order is not NULL, a non-zero value of the integer to which it
381 points means that the input starts off in host byte order, otherwise
382 the opposite order is assumed. Again, BOMs in the string can change
383 this. The final byte order is passed back at the end of processing.
384
385 If keep_boms is not zero, byte-order mark characters (0xfeff) are
386 copied into the output string. Otherwise they are discarded.
387
388 The result of the function is the number of 16-bit units placed into
389 the output buffer, including the zero terminator if the string was
390 zero-terminated.
391
392
393 SUBJECT STRING OFFSETS
394
395 The offsets within subject strings that are returned by the matching
396 functions are in 16-bit units rather than bytes.
397
398
399 NAMED SUBPATTERNS
400
401 The name-to-number translation table that is maintained for named sub-
402 patterns uses 16-bit characters. The pcre16_get_stringtable_entries()
403 function returns the length of each entry in the table as the number of
404 16-bit data units.
405
406
407 OPTION NAMES
408
409 There are two new general option names, PCRE_UTF16 and
410 PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and
411 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
412 define the same bits in the options word. There is a discussion about
413 the validity of UTF-16 strings in the pcreunicode page.
414
415 For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
416 that returns 1 if UTF-16 support is configured, otherwise 0. If this
417 option is given to pcre_config() or pcre32_config(), or if the
418 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con-
419 fig(), the result is the PCRE_ERROR_BADOPTION error.
420
421
422 CHARACTER CODES
423
424 In 16-bit mode, when PCRE_UTF16 is not set, character values are
425 treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
426 that they can range from 0 to 0xffff instead of 0 to 0xff. Character
427 types for characters less than 0xff can therefore be influenced by the
428 locale in the same way as before. Characters greater than 0xff have
429 only one case, and no "type" (such as letter or digit).
430
431 In UTF-16 mode, the character code is Unicode, in the range 0 to
432 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
433 because those are "surrogate" values that are used in pairs to encode
434 values greater than 0xffff.
435
436 A UTF-16 string can indicate its endianness by special code knows as a
437 byte-order mark (BOM). The PCRE functions do not handle this, expecting
438 strings to be in host byte order. A utility function called
439 pcre16_utf16_to_host_byte_order() is provided to help with this (see
440 above).
441
442
443 ERROR NAMES
444
445 The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
446 spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
447 given when a compiled pattern is passed to a function that processes
448 patterns in the other mode, for example, if a pattern compiled with
449 pcre_compile() is passed to pcre16_exec().
450
451 There are new error codes whose names begin with PCRE_UTF16_ERR for
452 invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
453 UTF-8 strings that are described in the section entitled "Reason codes
454 for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
455 are:
456
457 PCRE_UTF16_ERR1 Missing low surrogate at end of string
458 PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
459 PCRE_UTF16_ERR3 Isolated low surrogate
460 PCRE_UTF16_ERR4 Non-character
461
462
463 ERROR TEXTS
464
465 If there is an error while compiling a pattern, the error text that is
466 passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit
467 character string, zero-terminated.
468
469
470 CALLOUTS
471
472 The subject and mark fields in the callout block that is passed to a
473 callout function point to 16-bit vectors.
474
475
476 TESTING
477
478 The pcretest program continues to operate with 8-bit input and output
479 files, but it can be used for testing the 16-bit library. If it is run
480 with the command line option -16, patterns and subject strings are con-
481 verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
482 library functions are used instead of the 8-bit ones. Returned 16-bit
483 strings are converted to 8-bit for output. If both the 8-bit and the
484 32-bit libraries were not compiled, pcretest defaults to 16-bit and the
485 -16 option is ignored.
486
487 When PCRE is being built, the RunTest script that is called by "make
488 check" uses the pcretest -C option to discover which of the 8-bit,
489 16-bit and 32-bit libraries has been built, and runs the tests appro-
490 priately.
491
492
493 NOT SUPPORTED IN 16-BIT MODE
494
495 Not all the features of the 8-bit library are available with the 16-bit
496 library. The C++ and POSIX wrapper functions support only the 8-bit
497 library, and the pcregrep program is at present 8-bit only.
498
499
500 AUTHOR
501
502 Philip Hazel
503 University Computing Service
504 Cambridge CB2 3QH, England.
505
506
507 REVISION
508
509 Last updated: 08 November 2012
510 Copyright (c) 1997-2012 University of Cambridge.
511 ------------------------------------------------------------------------------
512
513
514 PCRE(3) Library Functions Manual PCRE(3)
515
516
517
518 NAME
519 PCRE - Perl-compatible regular expressions
520
521 #include <pcre.h>
522
523
524 PCRE 32-BIT API BASIC FUNCTIONS
525
526 pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
527 const char **errptr, int *erroffset,
528 const unsigned char *tableptr);
529
530 pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
531 int *errorcodeptr,
532 const char **errptr, int *erroffset,
533 const unsigned char *tableptr);
534
535 pcre32_extra *pcre32_study(const pcre32 *code, int options,
536 const char **errptr);
537
538 void pcre32_free_study(pcre32_extra *extra);
539
540 int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
541 PCRE_SPTR32 subject, int length, int startoffset,
542 int options, int *ovector, int ovecsize);
543
544 int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
545 PCRE_SPTR32 subject, int length, int startoffset,
546 int options, int *ovector, int ovecsize,
547 int *workspace, int wscount);
548
549
550 PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
551
552 int pcre32_copy_named_substring(const pcre32 *code,
553 PCRE_SPTR32 subject, int *ovector,
554 int stringcount, PCRE_SPTR32 stringname,
555 PCRE_UCHAR32 *buffer, int buffersize);
556
557 int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
558 int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
559 int buffersize);
560
561 int pcre32_get_named_substring(const pcre32 *code,
562 PCRE_SPTR32 subject, int *ovector,
563 int stringcount, PCRE_SPTR32 stringname,
564 PCRE_SPTR32 *stringptr);
565
566 int pcre32_get_stringnumber(const pcre32 *code,
567 PCRE_SPTR32 name);
568
569 int pcre32_get_stringtable_entries(const pcre32 *code,
570 PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
571
572 int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
573 int stringcount, int stringnumber,
574 PCRE_SPTR32 *stringptr);
575
576 int pcre32_get_substring_list(PCRE_SPTR32 subject,
577 int *ovector, int stringcount, PCRE_SPTR32 **listptr);
578
579 void pcre32_free_substring(PCRE_SPTR32 stringptr);
580
581 void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
582
583
584 PCRE 32-BIT API AUXILIARY FUNCTIONS
585
586 pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
587
588 void pcre32_jit_stack_free(pcre32_jit_stack *stack);
589
590 void pcre32_assign_jit_stack(pcre32_extra *extra,
591 pcre32_jit_callback callback, void *data);
592
593 const unsigned char *pcre32_maketables(void);
594
595 int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
596 int what, void *where);
597
598 int pcre32_refcount(pcre32 *code, int adjust);
599
600 int pcre32_config(int what, void *where);
601
602 const char *pcre32_version(void);
603
604 int pcre32_pattern_to_host_byte_order(pcre32 *code,
605 pcre32_extra *extra, const unsigned char *tables);
606
607
608 PCRE 32-BIT API INDIRECTED FUNCTIONS
609
610 void *(*pcre32_malloc)(size_t);
611
612 void (*pcre32_free)(void *);
613
614 void *(*pcre32_stack_malloc)(size_t);
615
616 void (*pcre32_stack_free)(void *);
617
618 int (*pcre32_callout)(pcre32_callout_block *);
619
620
621 PCRE 32-BIT API 32-BIT-ONLY FUNCTION
622
623 int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
624 PCRE_SPTR32 input, int length, int *byte_order,
625 int keep_boms);
626
627
628 THE PCRE 32-BIT LIBRARY
629
630 Starting with release 8.32, it is possible to compile a PCRE library
631 that supports 32-bit character strings, including UTF-32 strings, as
632 well as or instead of the original 8-bit library. This work was done by
633 Christian Persch, based on the work done by Zoltan Herczeg for the
634 16-bit library. All three libraries contain identical sets of func-
635 tions, used in exactly the same way. Only the names of the functions
636 and the data types of their arguments and results are different. To
637 avoid over-complication and reduce the documentation maintenance load,
638 most of the PCRE documentation describes the 8-bit library, with only
639 occasional references to the 16-bit and 32-bit libraries. This page
640 describes what is different when you use the 32-bit library.
641
642 WARNING: A single application can be linked with all or any of the
643 three libraries, but you must take care when processing any particular
644 pattern to use functions from just one library. For example, if you
645 want to study a pattern that was compiled with pcre32_compile(), you
646 must do so with pcre32_study(), not pcre_study(), and you must free the
647 study data with pcre32_free_study().
648
649
650 THE HEADER FILE
651
652 There is only one header file, pcre.h. It contains prototypes for all
653 the functions in all libraries, as well as definitions of flags, struc-
654 tures, error codes, etc.
655
656
657 THE LIBRARY NAME
658
659 In Unix-like systems, the 32-bit library is called libpcre32, and can
660 normally be accesss by adding -lpcre32 to the command for linking an
661 application that uses PCRE.
662
663
664 STRING TYPES
665
666 In the 8-bit library, strings are passed to PCRE library functions as
667 vectors of bytes with the C type "char *". In the 32-bit library,
668 strings are passed as vectors of unsigned 32-bit quantities. The macro
669 PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is
670 defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
671 int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
672 as "unsigned int", but checks that it really is a 32-bit data type. If
673 it is not, the build fails with an error message telling the maintainer
674 to modify the definition appropriately.
675
676
677 STRUCTURE TYPES
678
679 The types of the opaque structures that are used for compiled 32-bit
680 patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively.
681 The type of the user-accessible structure that is returned by
682 pcre32_study() is pcre32_extra, and the type of the structure that is
683 used for passing data to a callout function is pcre32_callout_block.
684 These structures contain the same fields, with the same names, as their
685 8-bit counterparts. The only difference is that pointers to character
686 strings are 32-bit instead of 8-bit types.
687
688
689 32-BIT FUNCTIONS
690
691 For every function in the 8-bit library there is a corresponding func-
692 tion in the 32-bit library with a name that starts with pcre32_ instead
693 of pcre_. The prototypes are listed above. In addition, there is one
694 extra function, pcre32_utf32_to_host_byte_order(). This is a utility
695 function that converts a UTF-32 character string to host byte order if
696 necessary. The other 32-bit functions expect the strings they are
697 passed to be in host byte order.
698
699 The input and output arguments of pcre32_utf32_to_host_byte_order() may
700 point to the same address, that is, conversion in place is supported.
701 The output buffer must be at least as long as the input.
702
703 The length argument specifies the number of 32-bit data units in the
704 input string; a negative value specifies a zero-terminated string.
705
706 If byte_order is NULL, it is assumed that the string starts off in host
707 byte order. This may be changed by byte-order marks (BOMs) anywhere in
708 the string (commonly as the first character).
709
710 If byte_order is not NULL, a non-zero value of the integer to which it
711 points means that the input starts off in host byte order, otherwise
712 the opposite order is assumed. Again, BOMs in the string can change
713 this. The final byte order is passed back at the end of processing.
714
715 If keep_boms is not zero, byte-order mark characters (0xfeff) are
716 copied into the output string. Otherwise they are discarded.
717
718 The result of the function is the number of 32-bit units placed into
719 the output buffer, including the zero terminator if the string was
720 zero-terminated.
721
722
723 SUBJECT STRING OFFSETS
724
725 The offsets within subject strings that are returned by the matching
726 functions are in 32-bit units rather than bytes.
727
728
729 NAMED SUBPATTERNS
730
731 The name-to-number translation table that is maintained for named sub-
732 patterns uses 32-bit characters. The pcre32_get_stringtable_entries()
733 function returns the length of each entry in the table as the number of
734 32-bit data units.
735
736
737 OPTION NAMES
738
739 There are two new general option names, PCRE_UTF32 and
740 PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and
741 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
742 define the same bits in the options word. There is a discussion about
743 the validity of UTF-32 strings in the pcreunicode page.
744
745 For the pcre32_config() function there is an option PCRE_CONFIG_UTF32
746 that returns 1 if UTF-32 support is configured, otherwise 0. If this
747 option is given to pcre_config() or pcre16_config(), or if the
748 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con-
749 fig(), the result is the PCRE_ERROR_BADOPTION error.
750
751
752 CHARACTER CODES
753
754 In 32-bit mode, when PCRE_UTF32 is not set, character values are
755 treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
756 that they can range from 0 to 0x7fffffff instead of 0 to 0xff. Charac-
757 ter types for characters less than 0xff can therefore be influenced by
758 the locale in the same way as before. Characters greater than 0xff
759 have only one case, and no "type" (such as letter or digit).
760
761 In UTF-32 mode, the character code is Unicode, in the range 0 to
762 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
763 because those are "surrogate" values that are ill-formed in UTF-32.
764
765 A UTF-32 string can indicate its endianness by special code knows as a
766 byte-order mark (BOM). The PCRE functions do not handle this, expecting
767 strings to be in host byte order. A utility function called
768 pcre32_utf32_to_host_byte_order() is provided to help with this (see
769 above).
770
771
772 ERROR NAMES
773
774 The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart.
775 The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
776 to a function that processes patterns in the other mode, for example,
777 if a pattern compiled with pcre_compile() is passed to pcre32_exec().
778
779 There are new error codes whose names begin with PCRE_UTF32_ERR for
780 invalid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for
781 UTF-8 strings that are described in the section entitled "Reason codes
782 for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
783 are:
784
785 PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
786 PCRE_UTF32_ERR2 Non-character
787 PCRE_UTF32_ERR3 Character > 0x10ffff
788
789
790 ERROR TEXTS
791
792 If there is an error while compiling a pattern, the error text that is
793 passed back by pcre32_compile() or pcre32_compile2() is still an 8-bit
794 character string, zero-terminated.
795
796
797 CALLOUTS
798
799 The subject and mark fields in the callout block that is passed to a
800 callout function point to 32-bit vectors.
801
802
803 TESTING
804
805 The pcretest program continues to operate with 8-bit input and output
806 files, but it can be used for testing the 32-bit library. If it is run
807 with the command line option -32, patterns and subject strings are con-
808 verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
809 library functions are used instead of the 8-bit ones. Returned 32-bit
810 strings are converted to 8-bit for output. If both the 8-bit and the
811 16-bit libraries were not compiled, pcretest defaults to 32-bit and the
812 -32 option is ignored.
813
814 When PCRE is being built, the RunTest script that is called by "make
815 check" uses the pcretest -C option to discover which of the 8-bit,
816 16-bit and 32-bit libraries has been built, and runs the tests appro-
817 priately.
818
819
820 NOT SUPPORTED IN 32-BIT MODE
821
822 Not all the features of the 8-bit library are available with the 32-bit
823 library. The C++ and POSIX wrapper functions support only the 8-bit
824 library, and the pcregrep program is at present 8-bit only.
825
826
827 AUTHOR
828
829 Philip Hazel
830 University Computing Service
831 Cambridge CB2 3QH, England.
832
833
834 REVISION
835
836 Last updated: 08 November 2012
837 Copyright (c) 1997-2012 University of Cambridge.
838 ------------------------------------------------------------------------------
839
840
841 PCREBUILD(3) Library Functions Manual PCREBUILD(3)
842
843
844
845 NAME
846 PCRE - Perl-compatible regular expressions
847
848 PCRE BUILD-TIME OPTIONS
849
850 This document describes the optional features of PCRE that can be
851 selected when the library is compiled. It assumes use of the configure
852 script, where the optional features are selected or deselected by pro-
853 viding options to configure before running the make command. However,
854 the same options can be selected in both Unix-like and non-Unix-like
855 environments using the GUI facility of cmake-gui if you are using CMake
856 instead of configure to build PCRE.
857
858 There is a lot more information about building PCRE without using con-
859 figure (including information about using CMake or building "by hand")
860 in the file called NON-AUTOTOOLS-BUILD, which is part of the PCRE dis-
861 tribution. You should consult this file as well as the README file if
862 you are building in a non-Unix-like environment.
863
864 The complete list of options for configure (which includes the standard
865 ones such as the selection of the installation directory) can be
866 obtained by running
867
868 ./configure --help
869
870 The following sections include descriptions of options whose names
871 begin with --enable or --disable. These settings specify changes to the
872 defaults for the configure command. Because of the way that configure
873 works, --enable and --disable always come in pairs, so the complemen-
874 tary option always exists as well, but as it specifies the default, it
875 is not described.
876
877
878 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
879
880 By default, a library called libpcre is built, containing functions
881 that take string arguments contained in vectors of bytes, either as
882 single-byte characters, or interpreted as UTF-8 strings. You can also
883 build a separate library, called libpcre16, in which strings are con-
884 tained in vectors of 16-bit data units and interpreted either as sin-
885 gle-unit characters or UTF-16 strings, by adding
886
887 --enable-pcre16
888
889 to the configure command. You can also build a separate library, called
890 libpcre32, in which strings are contained in vectors of 32-bit data
891 units and interpreted either as single-unit characters or UTF-32
892 strings, by adding
893
894 --enable-pcre32
895
896 to the configure command. If you do not want the 8-bit library, add
897
898 --disable-pcre8
899
900 as well. At least one of the three libraries must be built. Note that
901 the C++ and POSIX wrappers are for the 8-bit library only, and that
902 pcregrep is an 8-bit program. None of these are built if you select
903 only the 16-bit or 32-bit libraries.
904
905
906 BUILDING SHARED AND STATIC LIBRARIES
907
908 The PCRE building process uses libtool to build both shared and static
909 Unix libraries by default. You can suppress one of these by adding one
910 of
911
912 --disable-shared
913 --disable-static
914
915 to the configure command, as required.
916
917
918 C++ SUPPORT
919
920 By default, if the 8-bit library is being built, the configure script
921 will search for a C++ compiler and C++ header files. If it finds them,
922 it automatically builds the C++ wrapper library (which supports only
923 8-bit strings). You can disable this by adding
924
925 --disable-cpp
926
927 to the configure command.
928
929
930 UTF-8, UTF-16 AND UTF-32 SUPPORT
931
932 To build PCRE with support for UTF Unicode character strings, add
933
934 --enable-utf
935
936 to the configure command. This setting applies to all three libraries,
937 adding support for UTF-8 to the 8-bit library, support for UTF-16 to
938 the 16-bit library, and support for UTF-32 to the to the 32-bit
939 library. There are no separate options for enabling UTF-8, UTF-16 and
940 UTF-32 independently because that would allow ridiculous settings such
941 as requesting UTF-16 support while building only the 8-bit library. It
942 is not possible to build one library with UTF support and another with-
943 out in the same configuration. (For backwards compatibility, --enable-
944 utf8 is a synonym of --enable-utf.)
945
946 Of itself, this setting does not make PCRE treat strings as UTF-8,
947 UTF-16 or UTF-32. As well as compiling PCRE with this option, you also
948 have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as
949 appropriate) when you call one of the pattern compiling functions.
950
951 If you set --enable-utf when compiling in an EBCDIC environment, PCRE
952 expects its input to be either ASCII or UTF-8 (depending on the run-
953 time option). It is not possible to support both EBCDIC and UTF-8 codes
954 in the same version of the library. Consequently, --enable-utf and
955 --enable-ebcdic are mutually exclusive.
956
957
958 UNICODE CHARACTER PROPERTY SUPPORT
959
960 UTF support allows the libraries to process character codepoints up to
961 0x10ffff in the strings that they handle. On its own, however, it does
962 not provide any facilities for accessing the properties of such charac-
963 ters. If you want to be able to use the pattern escapes \P, \p, and \X,
964 which refer to Unicode character properties, you must add
965
966 --enable-unicode-properties
967
968 to the configure command. This implies UTF support, even if you have
969 not explicitly requested it.
970
971 Including Unicode property support adds around 30K of tables to the
972 PCRE library. Only the general category properties such as Lu and Nd
973 are supported. Details are given in the pcrepattern documentation.
974
975
976 JUST-IN-TIME COMPILER SUPPORT
977
978 Just-in-time compiler support is included in the build by specifying
979
980 --enable-jit
981
982 This support is available only for certain hardware architectures. If
983 this option is set for an unsupported architecture, a compile time
984 error occurs. See the pcrejit documentation for a discussion of JIT
985 usage. When JIT support is enabled, pcregrep automatically makes use of
986 it, unless you add
987
988 --disable-pcregrep-jit
989
990 to the "configure" command.
991
992
993 CODE VALUE OF NEWLINE
994
995 By default, PCRE interprets the linefeed (LF) character as indicating
996 the end of a line. This is the normal newline character on Unix-like
997 systems. You can compile PCRE to use carriage return (CR) instead, by
998 adding
999
1000 --enable-newline-is-cr
1001
1002 to the configure command. There is also a --enable-newline-is-lf
1003 option, which explicitly specifies linefeed as the newline character.
1004
1005 Alternatively, you can specify that line endings are to be indicated by
1006 the two character sequence CRLF. If you want this, add
1007
1008 --enable-newline-is-crlf
1009
1010 to the configure command. There is a fourth option, specified by
1011
1012 --enable-newline-is-anycrlf
1013
1014 which causes PCRE to recognize any of the three sequences CR, LF, or
1015 CRLF as indicating a line ending. Finally, a fifth option, specified by
1016
1017 --enable-newline-is-any
1018
1019 causes PCRE to recognize any Unicode newline sequence.
1020
1021 Whatever line ending convention is selected when PCRE is built can be
1022 overridden when the library functions are called. At build time it is
1023 conventional to use the standard for your operating system.
1024
1025
1026 WHAT \R MATCHES
1027
1028 By default, the sequence \R in a pattern matches any Unicode newline
1029 sequence, whatever has been selected as the line ending sequence. If
1030 you specify
1031
1032 --enable-bsr-anycrlf
1033
1034 the default is changed so that \R matches only CR, LF, or CRLF. What-
1035 ever is selected when PCRE is built can be overridden when the library
1036 functions are called.
1037
1038
1039 POSIX MALLOC USAGE
1040
1041 When the 8-bit library is called through the POSIX interface (see the
1042 pcreposix documentation), additional working storage is required for
1043 holding the pointers to capturing substrings, because PCRE requires
1044 three integers per substring, whereas the POSIX interface provides only
1045 two. If the number of expected substrings is small, the wrapper func-
1046 tion uses space on the stack, because this is faster than using mal-
1047 loc() for each call. The default threshold above which the stack is no
1048 longer used is 10; it can be changed by adding a setting such as
1049
1050 --with-posix-malloc-threshold=20
1051
1052 to the configure command.
1053
1054
1055 HANDLING VERY LARGE PATTERNS
1056
1057 Within a compiled pattern, offset values are used to point from one
1058 part to another (for example, from an opening parenthesis to an alter-
1059 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
1060 two-byte values are used for these offsets, leading to a maximum size
1061 for a compiled pattern of around 64K. This is sufficient to handle all
1062 but the most gigantic patterns. Nevertheless, some people do want to
1063 process truly enormous patterns, so it is possible to compile PCRE to
1064 use three-byte or four-byte offsets by adding a setting such as
1065
1066 --with-link-size=3
1067
1068 to the configure command. The value given must be 2, 3, or 4. For the
1069 16-bit library, a value of 3 is rounded up to 4. In these libraries,
1070 using longer offsets slows down the operation of PCRE because it has to
1071 load additional data when handling them. For the 32-bit library the
1072 value is always 4 and cannot be overridden; the value of --with-link-
1073 size is ignored.
1074
1075
1076 AVOIDING EXCESSIVE STACK USAGE
1077
1078 When matching with the pcre_exec() function, PCRE implements backtrack-
1079 ing by making recursive calls to an internal function called match().
1080 In environments where the size of the stack is limited, this can se-
1081 verely limit PCRE's operation. (The Unix environment does not usually
1082 suffer from this problem, but it may sometimes be necessary to increase
1083 the maximum stack size. There is a discussion in the pcrestack docu-
1084 mentation.) An alternative approach to recursion that uses memory from
1085 the heap to remember data, instead of using recursive function calls,
1086 has been implemented to work round the problem of limited stack size.
1087 If you want to build a version of PCRE that works this way, add
1088
1089 --disable-stack-for-recursion
1090
1091 to the configure command. With this configuration, PCRE will use the
1092 pcre_stack_malloc and pcre_stack_free variables to call memory manage-
1093 ment functions. By default these point to malloc() and free(), but you
1094 can replace the pointers so that your own functions are used instead.
1095
1096 Separate functions are provided rather than using pcre_malloc and
1097 pcre_free because the usage is very predictable: the block sizes
1098 requested are always the same, and the blocks are always freed in
1099 reverse order. A calling program might be able to implement optimized
1100 functions that perform better than malloc() and free(). PCRE runs
1101 noticeably more slowly when built in this way. This option affects only
1102 the pcre_exec() function; it is not relevant for pcre_dfa_exec().
1103
1104
1105 LIMITING PCRE RESOURCE USAGE
1106
1107 Internally, PCRE has a function called match(), which it calls repeat-
1108 edly (sometimes recursively) when matching a pattern with the
1109 pcre_exec() function. By controlling the maximum number of times this
1110 function may be called during a single matching operation, a limit can
1111 be placed on the resources used by a single call to pcre_exec(). The
1112 limit can be changed at run time, as described in the pcreapi documen-
1113 tation. The default is 10 million, but this can be changed by adding a
1114 setting such as
1115
1116 --with-match-limit=500000
1117
1118 to the configure command. This setting has no effect on the
1119 pcre_dfa_exec() matching function.
1120
1121 In some environments it is desirable to limit the depth of recursive
1122 calls of match() more strictly than the total number of calls, in order
1123 to restrict the maximum amount of stack (or heap, if --disable-stack-
1124 for-recursion is specified) that is used. A second limit controls this;
1125 it defaults to the value that is set for --with-match-limit, which
1126 imposes no additional constraints. However, you can set a lower limit
1127 by adding, for example,
1128
1129 --with-match-limit-recursion=10000
1130
1131 to the configure command. This value can also be overridden at run
1132 time.
1133
1134
1135 CREATING CHARACTER TABLES AT BUILD TIME
1136
1137 PCRE uses fixed tables for processing characters whose code values are
1138 less than 256. By default, PCRE is built with a set of tables that are
1139 distributed in the file pcre_chartables.c.dist. These tables are for
1140 ASCII codes only. If you add
1141
1142 --enable-rebuild-chartables
1143
1144 to the configure command, the distributed tables are no longer used.
1145 Instead, a program called dftables is compiled and run. This outputs
1146 the source for new set of tables, created in the default locale of your
1147 C run-time system. (This method of replacing the tables does not work
1148 if you are cross compiling, because dftables is run on the local host.
1149 If you need to create alternative tables when cross compiling, you will
1150 have to do so "by hand".)
1151
1152
1153 USING EBCDIC CODE
1154
1155 PCRE assumes by default that it will run in an environment where the
1156 character code is ASCII (or Unicode, which is a superset of ASCII).
1157 This is the case for most computer operating systems. PCRE can, how-
1158 ever, be compiled to run in an EBCDIC environment by adding
1159
1160 --enable-ebcdic
1161
1162 to the configure command. This setting implies --enable-rebuild-charta-
1163 bles. You should only use it if you know that you are in an EBCDIC
1164 environment (for example, an IBM mainframe operating system). The
1165 --enable-ebcdic option is incompatible with --enable-utf.
1166
1167 The EBCDIC character that corresponds to an ASCII LF is assumed to have
1168 the value 0x15 by default. However, in some EBCDIC environments, 0x25
1169 is used. In such an environment you should use
1170
1171 --enable-ebcdic-nl25
1172
1173 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
1174 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
1175 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
1176 acter (which, in Unicode, is 0x85).
1177
1178 The options that select newline behaviour, such as --enable-newline-is-
1179 cr, and equivalent run-time options, refer to these character values in
1180 an EBCDIC environment.
1181
1182
1183 PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
1184
1185 By default, pcregrep reads all files as plain text. You can build it so
1186 that it recognizes files whose names end in .gz or .bz2, and reads them
1187 with libz or libbz2, respectively, by adding one or both of
1188
1189 --enable-pcregrep-libz
1190 --enable-pcregrep-libbz2
1191
1192 to the configure command. These options naturally require that the rel-
1193 evant libraries are installed on your system. Configuration will fail
1194 if they are not.
1195
1196
1197 PCREGREP BUFFER SIZE
1198
1199 pcregrep uses an internal buffer to hold a "window" on the file it is
1200 scanning, in order to be able to output "before" and "after" lines when
1201 it finds a match. The size of the buffer is controlled by a parameter
1202 whose default value is 20K. The buffer itself is three times this size,
1203 but because of the way it is used for holding "before" lines, the long-
1204 est line that is guaranteed to be processable is the parameter size.
1205 You can change the default parameter value by adding, for example,
1206
1207 --with-pcregrep-bufsize=50K
1208
1209 to the configure command. The caller of pcregrep can, however, override
1210 this value by specifying a run-time option.
1211
1212
1213 PCRETEST OPTION FOR LIBREADLINE SUPPORT
1214
1215 If you add
1216
1217 --enable-pcretest-libreadline
1218
1219 to the configure command, pcretest is linked with the libreadline
1220 library, and when its input is from a terminal, it reads it using the
1221 readline() function. This provides line-editing and history facilities.
1222 Note that libreadline is GPL-licensed, so if you distribute a binary of
1223 pcretest linked in this way, there may be licensing issues.
1224
1225 Setting this option causes the -lreadline option to be added to the
1226 pcretest build. In many operating environments with a sytem-installed
1227 libreadline this is sufficient. However, in some environments (e.g. if
1228 an unmodified distribution version of readline is in use), some extra
1229 configuration may be necessary. The INSTALL file for libreadline says
1230 this:
1231
1232 "Readline uses the termcap functions, but does not link with the
1233 termcap or curses library itself, allowing applications which link
1234 with readline the to choose an appropriate library."
1235
1236 If your environment has not been set up so that an appropriate library
1237 is automatically included, you may need to add something like
1238
1239 LIBS="-ncurses"
1240
1241 immediately before the configure command.
1242
1243
1244 DEBUGGING WITH VALGRIND SUPPORT
1245
1246 By adding the
1247
1248 --enable-valgrind
1249
1250 option to to the configure command, PCRE will use valgrind annotations
1251 to mark certain memory regions as unaddressable. This allows it to
1252 detect invalid memory accesses, and is mostly useful for debugging PCRE
1253 itself.
1254
1255
1256 CODE COVERAGE REPORTING
1257
1258 If your C compiler is gcc, you can build a version of PCRE that can
1259 generate a code coverage report for its test suite. To enable this, you
1260 must install lcov version 1.6 or above. Then specify
1261
1262 --enable-coverage
1263
1264 to the configure command and build PCRE in the usual way.
1265
1266 Note that using ccache (a caching C compiler) is incompatible with code
1267 coverage reporting. If you have configured ccache to run automatically
1268 on your system, you must set the environment variable
1269
1270 CCACHE_DISABLE=1
1271
1272 before running make to build PCRE, so that ccache is not used.
1273
1274 When --enable-coverage is used, the following addition targets are
1275 added to the Makefile:
1276
1277 make coverage
1278
1279 This creates a fresh coverage report for the PCRE test suite. It is
1280 equivalent to running "make coverage-reset", "make coverage-baseline",
1281 "make check", and then "make coverage-report".
1282
1283 make coverage-reset
1284
1285 This zeroes the coverage counters, but does nothing else.
1286
1287 make coverage-baseline
1288
1289 This captures baseline coverage information.
1290
1291 make coverage-report
1292
1293 This creates the coverage report.
1294
1295 make coverage-clean-report
1296
1297 This removes the generated coverage report without cleaning the cover-
1298 age data itself.
1299
1300 make coverage-clean-data
1301
1302 This removes the captured coverage data without removing the coverage
1303 files created at compile time (*.gcno).
1304
1305 make coverage-clean
1306
1307 This cleans all coverage data including the generated coverage report.
1308 For more information about code coverage, see the gcov and lcov docu-
1309 mentation.
1310
1311
1312 SEE ALSO
1313
1314 pcreapi(3), pcre16, pcre32, pcre_config(3).
1315
1316
1317 AUTHOR
1318
1319 Philip Hazel
1320 University Computing Service
1321 Cambridge CB2 3QH, England.
1322
1323
1324 REVISION
1325
1326 Last updated: 30 October 2012
1327 Copyright (c) 1997-2012 University of Cambridge.
1328 ------------------------------------------------------------------------------
1329
1330
1331 PCREMATCHING(3) Library Functions Manual PCREMATCHING(3)
1332
1333
1334
1335 NAME
1336 PCRE - Perl-compatible regular expressions
1337
1338 PCRE MATCHING ALGORITHMS
1339
1340 This document describes the two different algorithms that are available
1341 in PCRE for matching a compiled regular expression against a given sub-
1342 ject string. The "standard" algorithm is the one provided by the
1343 pcre_exec(), pcre16_exec() and pcre32_exec() functions. These work in
1344 the same as as Perl's matching function, and provide a Perl-compatible
1345 matching operation. The just-in-time (JIT) optimization that is
1346 described in the pcrejit documentation is compatible with these func-
1347 tions.
1348
1349 An alternative algorithm is provided by the pcre_dfa_exec(),
1350 pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a
1351 different way, and are not Perl-compatible. This alternative has advan-
1352 tages and disadvantages compared with the standard algorithm, and these
1353 are described below.
1354
1355 When there is only one possible way in which a given subject string can
1356 match a pattern, the two algorithms give the same answer. A difference
1357 arises, however, when there are multiple possibilities. For example, if
1358 the pattern
1359
1360 ^<.*>
1361
1362 is matched against the string
1363
1364 <something> <something else> <something further>
1365
1366 there are three possible answers. The standard algorithm finds only one
1367 of them, whereas the alternative algorithm finds all three.
1368
1369
1370 REGULAR EXPRESSIONS AS TREES
1371
1372 The set of strings that are matched by a regular expression can be rep-
1373 resented as a tree structure. An unlimited repetition in the pattern
1374 makes the tree of infinite size, but it is still a tree. Matching the
1375 pattern to a given subject string (from a given starting point) can be
1376 thought of as a search of the tree. There are two ways to search a
1377 tree: depth-first and breadth-first, and these correspond to the two
1378 matching algorithms provided by PCRE.
1379
1380
1381 THE STANDARD MATCHING ALGORITHM
1382
1383 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
1384 sions", the standard algorithm is an "NFA algorithm". It conducts a
1385 depth-first search of the pattern tree. That is, it proceeds along a
1386 single path through the tree, checking that the subject matches what is
1387 required. When there is a mismatch, the algorithm tries any alterna-
1388 tives at the current point, and if they all fail, it backs up to the
1389 previous branch point in the tree, and tries the next alternative
1390 branch at that level. This often involves backing up (moving to the
1391 left) in the subject string as well. The order in which repetition
1392 branches are tried is controlled by the greedy or ungreedy nature of
1393 the quantifier.
1394
1395 If a leaf node is reached, a matching string has been found, and at
1396 that point the algorithm stops. Thus, if there is more than one possi-
1397 ble match, this algorithm returns the first one that it finds. Whether
1398 this is the shortest, the longest, or some intermediate length depends
1399 on the way the greedy and ungreedy repetition quantifiers are specified
1400 in the pattern.
1401
1402 Because it ends up with a single path through the tree, it is rela-
1403 tively straightforward for this algorithm to keep track of the sub-
1404 strings that are matched by portions of the pattern in parentheses.
1405 This provides support for capturing parentheses and back references.
1406
1407
1408 THE ALTERNATIVE MATCHING ALGORITHM
1409
1410 This algorithm conducts a breadth-first search of the tree. Starting
1411 from the first matching point in the subject, it scans the subject
1412 string from left to right, once, character by character, and as it does
1413 this, it remembers all the paths through the tree that represent valid
1414 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
1415 though it is not implemented as a traditional finite state machine (it
1416 keeps multiple states active simultaneously).
1417
1418 Although the general principle of this matching algorithm is that it
1419 scans the subject string only once, without backtracking, there is one
1420 exception: when a lookaround assertion is encountered, the characters
1421 following or preceding the current point have to be independently
1422 inspected.
1423
1424 The scan continues until either the end of the subject is reached, or
1425 there are no more unterminated paths. At this point, terminated paths
1426 represent the different matching possibilities (if there are none, the
1427 match has failed). Thus, if there is more than one possible match,
1428 this algorithm finds all of them, and in particular, it finds the long-
1429 est. The matches are returned in decreasing order of length. There is
1430 an option to stop the algorithm after the first match (which is neces-
1431 sarily the shortest) is found.
1432
1433 Note that all the matches that are found start at the same point in the
1434 subject. If the pattern
1435
1436 cat(er(pillar)?)?
1437
1438 is matched against the string "the caterpillar catchment", the result
1439 will be the three strings "caterpillar", "cater", and "cat" that start
1440 at the fifth character of the subject. The algorithm does not automati-
1441 cally move on to find matches that start at later positions.
1442
1443 There are a number of features of PCRE regular expressions that are not
1444 supported by the alternative matching algorithm. They are as follows:
1445
1446 1. Because the algorithm finds all possible matches, the greedy or
1447 ungreedy nature of repetition quantifiers is not relevant. Greedy and
1448 ungreedy quantifiers are treated in exactly the same way. However, pos-
1449 sessive quantifiers can make a difference when what follows could also
1450 match what is quantified, for example in a pattern like this:
1451
1452 ^a++\w!
1453
1454 This pattern matches "aaab!" but not "aaa!", which would be matched by
1455 a non-possessive quantifier. Similarly, if an atomic group is present,
1456 it is matched as if it were a standalone pattern at the current point,
1457 and the longest match is then "locked in" for the rest of the overall
1458 pattern.
1459
1460 2. When dealing with multiple paths through the tree simultaneously, it
1461 is not straightforward to keep track of captured substrings for the
1462 different matching possibilities, and PCRE's implementation of this
1463 algorithm does not attempt to do this. This means that no captured sub-
1464 strings are available.
1465
1466 3. Because no substrings are captured, back references within the pat-
1467 tern are not supported, and cause errors if encountered.
1468
1469 4. For the same reason, conditional expressions that use a backrefer-
1470 ence as the condition or test for a specific group recursion are not
1471 supported.
1472
1473 5. Because many paths through the tree may be active, the \K escape
1474 sequence, which resets the start of the match when encountered (but may
1475 be on some paths and not on others), is not supported. It causes an
1476 error if encountered.
1477
1478 6. Callouts are supported, but the value of the capture_top field is
1479 always 1, and the value of the capture_last field is always -1.
1480
1481 7. The \C escape sequence, which (in the standard algorithm) always
1482 matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
1483 not supported in these modes, because the alternative algorithm moves
1484 through the subject string one character (not data unit) at a time, for
1485 all active paths through the tree.
1486
1487 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1488 are not supported. (*FAIL) is supported, and behaves like a failing
1489 negative assertion.
1490
1491
1492 ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1493
1494 Using the alternative matching algorithm provides the following advan-
1495 tages:
1496
1497 1. All possible matches (at a single point in the subject) are automat-
1498 ically found, and in particular, the longest match is found. To find
1499 more than one match using the standard algorithm, you have to do kludgy
1500 things with callouts.
1501
1502 2. Because the alternative algorithm scans the subject string just
1503 once, and never needs to backtrack (except for lookbehinds), it is pos-
1504 sible to pass very long subject strings to the matching function in
1505 several pieces, checking for partial matching each time. Although it is
1506 possible to do multi-segment matching using the standard algorithm by
1507 retaining partially matched substrings, it is more complicated. The
1508 pcrepartial documentation gives details of partial matching and dis-
1509 cusses multi-segment matching.
1510
1511
1512 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1513
1514 The alternative algorithm suffers from a number of disadvantages:
1515
1516 1. It is substantially slower than the standard algorithm. This is
1517 partly because it has to search for all possible matches, but is also
1518 because it is less susceptible to optimization.
1519
1520 2. Capturing parentheses and back references are not supported.
1521
1522 3. Although atomic groups are supported, their use does not provide the
1523 performance advantage that it does for the standard algorithm.
1524
1525
1526 AUTHOR
1527
1528 Philip Hazel
1529 University Computing Service
1530 Cambridge CB2 3QH, England.
1531
1532
1533 REVISION
1534
1535 Last updated: 08 January 2012
1536 Copyright (c) 1997-2012 University of Cambridge.
1537 ------------------------------------------------------------------------------
1538
1539
1540 PCREAPI(3) Library Functions Manual PCREAPI(3)
1541
1542
1543
1544 NAME
1545 PCRE - Perl-compatible regular expressions
1546
1547 #include <pcre.h>
1548
1549
1550 PCRE NATIVE API BASIC FUNCTIONS
1551
1552 pcre *pcre_compile(const char *pattern, int options,
1553 const char **errptr, int *erroffset,
1554 const unsigned char *tableptr);
1555
1556 pcre *pcre_compile2(const char *pattern, int options,
1557 int *errorcodeptr,
1558 const char **errptr, int *erroffset,
1559 const unsigned char *tableptr);
1560
1561 pcre_extra *pcre_study(const pcre *code, int options,
1562 const char **errptr);
1563
1564 void pcre_free_study(pcre_extra *extra);
1565
1566 int pcre_exec(const pcre *code, const pcre_extra *extra,
1567 const char *subject, int length, int startoffset,
1568 int options, int *ovector, int ovecsize);
1569
1570 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1571 const char *subject, int length, int startoffset,
1572 int options, int *ovector, int ovecsize,
1573 int *workspace, int wscount);
1574
1575
1576 PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1577
1578 int pcre_copy_named_substring(const pcre *code,
1579 const char *subject, int *ovector,
1580 int stringcount, const char *stringname,
1581 char *buffer, int buffersize);
1582
1583 int pcre_copy_substring(const char *subject, int *ovector,
1584 int stringcount, int stringnumber, char *buffer,
1585 int buffersize);
1586
1587 int pcre_get_named_substring(const pcre *code,
1588 const char *subject, int *ovector,
1589 int stringcount, const char *stringname,
1590 const char **stringptr);
1591
1592 int pcre_get_stringnumber(const pcre *code,
1593 const char *name);
1594
1595 int pcre_get_stringtable_entries(const pcre *code,
1596 const char *name, char **first, char **last);
1597
1598 int pcre_get_substring(const char *subject, int *ovector,
1599 int stringcount, int stringnumber,
1600 const char **stringptr);
1601
1602 int pcre_get_substring_list(const char *subject,
1603 int *ovector, int stringcount, const char ***listptr);
1604
1605 void pcre_free_substring(const char *stringptr);
1606
1607 void pcre_free_substring_list(const char **stringptr);
1608
1609
1610 PCRE NATIVE API AUXILIARY FUNCTIONS
1611
1612 int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
1613 const char *subject, int length, int startoffset,
1614 int options, int *ovector, int ovecsize,
1615 pcre_jit_stack *jstack);
1616
1617 pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1618
1619 void pcre_jit_stack_free(pcre_jit_stack *stack);
1620
1621 void pcre_assign_jit_stack(pcre_extra *extra,
1622 pcre_jit_callback callback, void *data);
1623
1624 const unsigned char *pcre_maketables(void);
1625
1626 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1627 int what, void *where);
1628
1629 int pcre_refcount(pcre *code, int adjust);
1630
1631 int pcre_config(int what, void *where);
1632
1633 const char *pcre_version(void);
1634
1635 int pcre_pattern_to_host_byte_order(pcre *code,
1636 pcre_extra *extra, const unsigned char *tables);
1637
1638
1639 PCRE NATIVE API INDIRECTED FUNCTIONS
1640
1641 void *(*pcre_malloc)(size_t);
1642
1643 void (*pcre_free)(void *);
1644
1645 void *(*pcre_stack_malloc)(size_t);
1646
1647 void (*pcre_stack_free)(void *);
1648
1649 int (*pcre_callout)(pcre_callout_block *);
1650
1651
1652 PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1653
1654 As well as support for 8-bit character strings, PCRE also supports
1655 16-bit strings (from release 8.30) and 32-bit strings (from release
1656 8.32), by means of two additional libraries. They can be built as well
1657 as, or instead of, the 8-bit library. To avoid too much complication,
1658 this document describes the 8-bit versions of the functions, with only
1659 occasional references to the 16-bit and 32-bit libraries.
1660
1661 The 16-bit and 32-bit functions operate in the same way as their 8-bit
1662 counterparts; they just use different data types for their arguments
1663 and results, and their names start with pcre16_ or pcre32_ instead of
1664 pcre_. For every option that has UTF8 in its name (for example,
1665 PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8
1666 replaced by UTF16 or UTF32, respectively. This facility is in fact just
1667 cosmetic; the 16-bit and 32-bit option names define the same bit val-
1668 ues.
1669
1670 References to bytes and UTF-8 in this document should be read as refer-
1671 ences to 16-bit data quantities and UTF-16 when using the 16-bit
1672 library, or 32-bit data quantities and UTF-32 when using the 32-bit
1673 library, unless specified otherwise. More details of the specific dif-
1674 ferences for the 16-bit and 32-bit libraries are given in the pcre16
1675 and pcre32 pages.
1676
1677
1678 PCRE API OVERVIEW
1679
1680 PCRE has its own native API, which is described in this document. There
1681 are also some wrapper functions (for the 8-bit library only) that cor-
1682 respond to the POSIX regular expression API, but they do not give
1683 access to all the functionality. They are described in the pcreposix
1684 documentation. Both of these APIs define a set of C function calls. A
1685 C++ wrapper (again for the 8-bit library only) is also distributed with
1686 PCRE. It is documented in the pcrecpp page.
1687
1688 The native API C function prototypes are defined in the header file
1689 pcre.h, and on Unix-like systems the (8-bit) library itself is called
1690 libpcre. It can normally be accessed by adding -lpcre to the command
1691 for linking an application that uses PCRE. The header file defines the
1692 macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1693 numbers for the library. Applications can use these to include support
1694 for different releases of PCRE.
1695
1696 In a Windows environment, if you want to statically link an application
1697 program against a non-dll pcre.a file, you must define PCRE_STATIC
1698 before including pcre.h or pcrecpp.h, because otherwise the pcre_mal-
1699 loc() and pcre_free() exported functions will be declared
1700 __declspec(dllimport), with unwanted results.
1701
1702 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
1703 pcre_exec() are used for compiling and matching regular expressions in
1704 a Perl-compatible manner. A sample program that demonstrates the sim-
1705 plest way of using them is provided in the file called pcredemo.c in
1706 the PCRE source distribution. A listing of this program is given in the
1707 pcredemo documentation, and the pcresample documentation describes how
1708 to compile and run it.
1709
1710 Just-in-time compiler support is an optional feature of PCRE that can
1711 be built in appropriate hardware environments. It greatly speeds up the
1712 matching performance of many patterns. Simple programs can easily
1713 request that it be used if available, by setting an option that is
1714 ignored when it is not relevant. More complicated programs might need
1715 to make use of the functions pcre_jit_stack_alloc(),
1716 pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
1717 the JIT code's memory usage.
1718
1719 From release 8.32 there is also a direct interface for JIT execution,
1720 which gives improved performance. The JIT-specific functions are dis-
1721 cussed in the pcrejit documentation.
1722
1723 A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1724 ble, is also provided. This uses a different algorithm for the match-
1725 ing. The alternative algorithm finds all possible matches (at a given
1726 point in the subject), and scans the subject just once (unless there
1727 are lookbehind assertions). However, this algorithm does not return
1728 captured substrings. A description of the two matching algorithms and
1729 their advantages and disadvantages is given in the pcrematching docu-
1730 mentation.
1731
1732 In addition to the main compiling and matching functions, there are
1733 convenience functions for extracting captured substrings from a subject
1734 string that is matched by pcre_exec(). They are:
1735
1736 pcre_copy_substring()
1737 pcre_copy_named_substring()
1738 pcre_get_substring()
1739 pcre_get_named_substring()
1740 pcre_get_substring_list()
1741 pcre_get_stringnumber()
1742 pcre_get_stringtable_entries()
1743
1744 pcre_free_substring() and pcre_free_substring_list() are also provided,
1745 to free the memory used for extracted strings.
1746
1747 The function pcre_maketables() is used to build a set of character
1748 tables in the current locale for passing to pcre_compile(),
1749 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
1750 provided for specialist use. Most commonly, no special tables are
1751 passed, in which case internal tables that are generated when PCRE is
1752 built are used.
1753
1754 The function pcre_fullinfo() is used to find out information about a
1755 compiled pattern. The function pcre_version() returns a pointer to a
1756 string containing the version of PCRE and its date of release.
1757
1758 The function pcre_refcount() maintains a reference count in a data
1759 block containing a compiled pattern. This is provided for the benefit
1760 of object-oriented applications.
1761
1762 The global variables pcre_malloc and pcre_free initially contain the
1763 entry points of the standard malloc() and free() functions, respec-
1764 tively. PCRE calls the memory management functions via these variables,
1765 so a calling program can replace them if it wishes to intercept the
1766 calls. This should be done before calling any PCRE functions.
1767
1768 The global variables pcre_stack_malloc and pcre_stack_free are also
1769 indirections to memory management functions. These special functions
1770 are used only when PCRE is compiled to use the heap for remembering
1771 data, instead of recursive function calls, when running the pcre_exec()
1772 function. See the pcrebuild documentation for details of how to do
1773 this. It is a non-standard way of building PCRE, for use in environ-
1774 ments that have limited stacks. Because of the greater use of memory
1775 management, it runs more slowly. Separate functions are provided so
1776 that special-purpose external code can be used for this case. When
1777 used, these functions are always called in a stack-like manner (last
1778 obtained, first freed), and always for memory blocks of the same size.
1779 There is a discussion about PCRE's stack usage in the pcrestack docu-
1780 mentation.
1781
1782 The global variable pcre_callout initially contains NULL. It can be set
1783 by the caller to a "callout" function, which PCRE will then call at
1784 specified points during a matching operation. Details are given in the
1785 pcrecallout documentation.
1786
1787
1788 NEWLINES
1789
1790 PCRE supports five different conventions for indicating line breaks in
1791 strings: a single CR (carriage return) character, a single LF (line-
1792 feed) character, the two-character sequence CRLF, any of the three pre-
1793 ceding, or any Unicode newline sequence. The Unicode newline sequences
1794 are the three just mentioned, plus the single characters VT (vertical
1795 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1796 separator, U+2028), and PS (paragraph separator, U+2029).
1797
1798 Each of the first three conventions is used by at least one operating
1799 system as its standard newline sequence. When PCRE is built, a default
1800 can be specified. The default default is LF, which is the Unix stan-
1801 dard. When PCRE is run, the default can be overridden, either when a
1802 pattern is compiled, or when it is matched.
1803
1804 At compile time, the newline convention can be specified by the options
1805 argument of pcre_compile(), or it can be specified by special text at
1806 the start of the pattern itself; this overrides any other settings. See
1807 the pcrepattern page for details of the special character sequences.
1808
1809 In the PCRE documentation the word "newline" is used to mean "the char-
1810 acter or pair of characters that indicate a line break". The choice of
1811 newline convention affects the handling of the dot, circumflex, and
1812 dollar metacharacters, the handling of #-comments in /x mode, and, when
1813 CRLF is a recognized line ending sequence, the match position advance-
1814 ment for a non-anchored pattern. There is more detail about this in the
1815 section on pcre_exec() options below.
1816
1817 The choice of newline convention does not affect the interpretation of
1818 the \n or \r escape sequences, nor does it affect what \R matches,
1819 which is controlled in a similar way, but by separate options.
1820
1821
1822 MULTITHREADING
1823
1824 The PCRE functions can be used in multi-threading applications, with
1825 the proviso that the memory management functions pointed to by
1826 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1827 callout function pointed to by pcre_callout, are shared by all threads.
1828
1829 The compiled form of a regular expression is not altered during match-
1830 ing, so the same compiled pattern can safely be used by several threads
1831 at once.
1832
1833 If the just-in-time optimization feature is being used, it needs sepa-
1834 rate memory stack areas for each thread. See the pcrejit documentation
1835 for more details.
1836
1837
1838 SAVING PRECOMPILED PATTERNS FOR LATER USE
1839
1840 The compiled form of a regular expression can be saved and re-used at a
1841 later time, possibly by a different program, and even on a host other
1842 than the one on which it was compiled. Details are given in the
1843 pcreprecompile documentation, which includes a description of the
1844 pcre_pattern_to_host_byte_order() function. However, compiling a regu-
1845 lar expression with one version of PCRE for use with a different ver-
1846 sion is not guaranteed to work and may cause crashes.
1847
1848
1849 CHECKING BUILD-TIME OPTIONS
1850
1851 int pcre_config(int what, void *where);
1852
1853 The function pcre_config() makes it possible for a PCRE client to dis-
1854 cover which optional features have been compiled into the PCRE library.
1855 The pcrebuild documentation has more details about these optional fea-
1856 tures.
1857
1858 The first argument for pcre_config() is an integer, specifying which
1859 information is required; the second argument is a pointer to a variable
1860 into which the information is placed. The returned value is zero on
1861 success, or the negative error code PCRE_ERROR_BADOPTION if the value
1862 in the first argument is not recognized. The following information is
1863 available:
1864
1865 PCRE_CONFIG_UTF8
1866
1867 The output is an integer that is set to one if UTF-8 support is avail-
1868 able; otherwise it is set to zero. This value should normally be given
1869 to the 8-bit version of this function, pcre_config(). If it is given to
1870 the 16-bit or 32-bit version of this function, the result is
1871 PCRE_ERROR_BADOPTION.
1872
1873 PCRE_CONFIG_UTF16
1874
1875 The output is an integer that is set to one if UTF-16 support is avail-
1876 able; otherwise it is set to zero. This value should normally be given
1877 to the 16-bit version of this function, pcre16_config(). If it is given
1878 to the 8-bit or 32-bit version of this function, the result is
1879 PCRE_ERROR_BADOPTION.
1880
1881 PCRE_CONFIG_UTF32
1882
1883 The output is an integer that is set to one if UTF-32 support is avail-
1884 able; otherwise it is set to zero. This value should normally be given
1885 to the 32-bit version of this function, pcre32_config(). If it is given
1886 to the 8-bit or 16-bit version of this function, the result is
1887 PCRE_ERROR_BADOPTION.
1888
1889 PCRE_CONFIG_UNICODE_PROPERTIES
1890
1891 The output is an integer that is set to one if support for Unicode
1892 character properties is available; otherwise it is set to zero.
1893
1894 PCRE_CONFIG_JIT
1895
1896 The output is an integer that is set to one if support for just-in-time
1897 compiling is available; otherwise it is set to zero.
1898
1899 PCRE_CONFIG_JITTARGET
1900
1901 The output is a pointer to a zero-terminated "const char *" string. If
1902 JIT support is available, the string contains the name of the architec-
1903 ture for which the JIT compiler is configured, for example "x86 32bit
1904 (little endian + unaligned)". If JIT support is not available, the
1905 result is NULL.
1906
1907 PCRE_CONFIG_NEWLINE
1908
1909 The output is an integer whose value specifies the default character
1910 sequence that is recognized as meaning "newline". The values that are
1911 supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
1912 for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
1913 ANYCRLF, and ANY yield the same values. However, the value for LF is
1914 normally 21, though some EBCDIC environments use 37. The corresponding
1915 values for CRLF are 3349 and 3365. The default should normally corre-
1916 spond to the standard sequence for your operating system.
1917
1918 PCRE_CONFIG_BSR
1919
1920 The output is an integer whose value indicates what character sequences
1921 the \R escape sequence matches by default. A value of 0 means that \R
1922 matches any Unicode line ending sequence; a value of 1 means that \R
1923 matches only CR, LF, or CRLF. The default can be overridden when a pat-
1924 tern is compiled or matched.
1925
1926 PCRE_CONFIG_LINK_SIZE
1927
1928 The output is an integer that contains the number of bytes used for
1929 internal linkage in compiled regular expressions. For the 8-bit
1930 library, the value can be 2, 3, or 4. For the 16-bit library, the value
1931 is either 2 or 4 and is still a number of bytes. For the 32-bit
1932 library, the value is either 2 or 4 and is still a number of bytes. The
1933 default value of 2 is sufficient for all but the most massive patterns,
1934 since it allows the compiled pattern to be up to 64K in size. Larger
1935 values allow larger regular expressions to be compiled, at the expense
1936 of slower matching.
1937
1938 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1939
1940 The output is an integer that contains the threshold above which the
1941 POSIX interface uses malloc() for output vectors. Further details are
1942 given in the pcreposix documentation.
1943
1944 PCRE_CONFIG_MATCH_LIMIT
1945
1946 The output is a long integer that gives the default limit for the num-
1947 ber of internal matching function calls in a pcre_exec() execution.
1948 Further details are given with pcre_exec() below.
1949
1950 PCRE_CONFIG_MATCH_LIMIT_RECURSION
1951
1952 The output is a long integer that gives the default limit for the depth
1953 of recursion when calling the internal matching function in a
1954 pcre_exec() execution. Further details are given with pcre_exec()
1955 below.
1956
1957 PCRE_CONFIG_STACKRECURSE
1958
1959 The output is an integer that is set to one if internal recursion when
1960 running pcre_exec() is implemented by recursive function calls that use
1961 the stack to remember their state. This is the usual way that PCRE is
1962 compiled. The output is zero if PCRE was compiled to use blocks of data
1963 on the heap instead of recursive function calls. In this case,
1964 pcre_stack_malloc and pcre_stack_free are called to manage memory
1965 blocks on the heap, thus avoiding the use of the stack.
1966
1967
1968 COMPILING A PATTERN
1969
1970 pcre *pcre_compile(const char *pattern, int options,
1971 const char **errptr, int *erroffset,
1972 const unsigned char *tableptr);
1973
1974 pcre *pcre_compile2(const char *pattern, int options,
1975 int *errorcodeptr,
1976 const char **errptr, int *erroffset,
1977 const unsigned char *tableptr);
1978
1979 Either of the functions pcre_compile() or pcre_compile2() can be called
1980 to compile a pattern into an internal form. The only difference between
1981 the two interfaces is that pcre_compile2() has an additional argument,
1982 errorcodeptr, via which a numerical error code can be returned. To
1983 avoid too much repetition, we refer just to pcre_compile() below, but
1984 the information applies equally to pcre_compile2().
1985
1986 The pattern is a C string terminated by a binary zero, and is passed in
1987 the pattern argument. A pointer to a single block of memory that is
1988 obtained via pcre_malloc is returned. This contains the compiled code
1989 and related data. The pcre type is defined for the returned block; this
1990 is a typedef for a structure whose contents are not externally defined.
1991 It is up to the caller to free the memory (via pcre_free) when it is no
1992 longer required.
1993
1994 Although the compiled code of a PCRE regex is relocatable, that is, it
1995 does not depend on memory location, the complete pcre data block is not
1996 fully relocatable, because it may contain a copy of the tableptr argu-
1997 ment, which is an address (see below).
1998
1999 The options argument contains various bit settings that affect the com-
2000 pilation. It should be zero if no options are required. The available
2001 options are described below. Some of them (in particular, those that
2002 are compatible with Perl, but some others as well) can also be set and
2003 unset from within the pattern (see the detailed description in the
2004 pcrepattern documentation). For those options that can be different in
2005 different parts of the pattern, the contents of the options argument
2006 specifies their settings at the start of compilation and execution. The
2007 PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2008 PCRE_NO_START_OPTIMIZE options can be set at the time of matching as
2009 well as at compile time.
2010
2011 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
2012 if compilation of a pattern fails, pcre_compile() returns NULL, and
2013 sets the variable pointed to by errptr to point to a textual error mes-
2014 sage. This is a static string that is part of the library. You must not
2015 try to free it. Normally, the offset from the start of the pattern to
2016 the byte that was being processed when the error was discovered is
2017 placed in the variable pointed to by erroffset, which must not be NULL
2018 (if it is, an immediate error is given). However, for an invalid UTF-8
2019 string, the offset is that of the first byte of the failing character.
2020
2021 Some errors are not detected until the whole pattern has been scanned;
2022 in these cases, the offset passed back is the length of the pattern.
2023 Note that the offset is in bytes, not characters, even in UTF-8 mode.
2024 It may sometimes point into the middle of a UTF-8 character.
2025
2026 If pcre_compile2() is used instead of pcre_compile(), and the error-
2027 codeptr argument is not NULL, a non-zero error code number is returned
2028 via this argument in the event of an error. This is in addition to the
2029 textual error message. Error codes and messages are listed below.
2030
2031 If the final argument, tableptr, is NULL, PCRE uses a default set of
2032 character tables that are built when PCRE is compiled, using the
2033 default C locale. Otherwise, tableptr must be an address that is the
2034 result of a call to pcre_maketables(). This value is stored with the
2035 compiled pattern, and used again by pcre_exec(), unless another table
2036 pointer is passed to it. For more discussion, see the section on locale
2037 support below.
2038
2039 This code fragment shows a typical straightforward call to pcre_com-
2040 pile():
2041
2042 pcre *re;
2043 const char *error;
2044 int erroffset;
2045 re = pcre_compile(
2046 "^A.*Z", /* the pattern */
2047 0, /* default options */
2048 &error, /* for error message */
2049 &erroffset, /* for error offset */
2050 NULL); /* use default character tables */
2051
2052 The following names for option bits are defined in the pcre.h header
2053 file:
2054
2055 PCRE_ANCHORED
2056
2057 If this bit is set, the pattern is forced to be "anchored", that is, it
2058 is constrained to match only at the first matching point in the string
2059 that is being searched (the "subject string"). This effect can also be
2060 achieved by appropriate constructs in the pattern itself, which is the
2061 only way to do it in Perl.
2062
2063 PCRE_AUTO_CALLOUT
2064
2065 If this bit is set, pcre_compile() automatically inserts callout items,
2066 all with number 255, before each pattern item. For discussion of the
2067 callout facility, see the pcrecallout documentation.
2068
2069 PCRE_BSR_ANYCRLF
2070 PCRE_BSR_UNICODE
2071
2072 These options (which are mutually exclusive) control what the \R escape
2073 sequence matches. The choice is either to match only CR, LF, or CRLF,
2074 or to match any Unicode newline sequence. The default is specified when
2075 PCRE is built. It can be overridden from within the pattern, or by set-
2076 ting an option when a compiled pattern is matched.
2077
2078 PCRE_CASELESS
2079
2080 If this bit is set, letters in the pattern match both upper and lower
2081 case letters. It is equivalent to Perl's /i option, and it can be
2082 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2083 always understands the concept of case for characters whose values are
2084 less than 128, so caseless matching is always possible. For characters
2085 with higher values, the concept of case is supported if PCRE is com-
2086 piled with Unicode property support, but not otherwise. If you want to
2087 use caseless matching for characters 128 and above, you must ensure
2088 that PCRE is compiled with Unicode property support as well as with
2089 UTF-8 support.
2090
2091 PCRE_DOLLAR_ENDONLY
2092
2093 If this bit is set, a dollar metacharacter in the pattern matches only
2094 at the end of the subject string. Without this option, a dollar also
2095 matches immediately before a newline at the end of the string (but not
2096 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
2097 if PCRE_MULTILINE is set. There is no equivalent to this option in
2098 Perl, and no way to set it within a pattern.
2099
2100 PCRE_DOTALL
2101
2102 If this bit is set, a dot metacharacter in the pattern matches a char-
2103 acter of any value, including one that indicates a newline. However, it
2104 only ever matches one character, even if newlines are coded as CRLF.
2105 Without this option, a dot does not match when the current position is
2106 at a newline. This option is equivalent to Perl's /s option, and it can
2107 be changed within a pattern by a (?s) option setting. A negative class
2108 such as [^a] always matches newline characters, independent of the set-
2109 ting of this option.
2110
2111 PCRE_DUPNAMES
2112
2113 If this bit is set, names used to identify capturing subpatterns need
2114 not be unique. This can be helpful for certain types of pattern when it
2115 is known that only one instance of the named subpattern can ever be
2116 matched. There are more details of named subpatterns below; see also
2117 the pcrepattern documentation.
2118
2119 PCRE_EXTENDED
2120
2121 If this bit is set, white space data characters in the pattern are
2122 totally ignored except when escaped or inside a character class. White
2123 space does not include the VT character (code 11). In addition, charac-
2124 ters between an unescaped # outside a character class and the next new-
2125 line, inclusive, are also ignored. This is equivalent to Perl's /x
2126 option, and it can be changed within a pattern by a (?x) option set-
2127 ting.
2128
2129 Which characters are interpreted as newlines is controlled by the
2130 options passed to pcre_compile() or by a special sequence at the start
2131 of the pattern, as described in the section entitled "Newline conven-
2132 tions" in the pcrepattern documentation. Note that the end of this type
2133 of comment is a literal newline sequence in the pattern; escape
2134 sequences that happen to represent a newline do not count.
2135
2136 This option makes it possible to include comments inside complicated
2137 patterns. Note, however, that this applies only to data characters.
2138 White space characters may never appear within special character
2139 sequences in a pattern, for example within the sequence (?( that intro-
2140 duces a conditional subpattern.
2141
2142 PCRE_EXTRA
2143
2144 This option was invented in order to turn on additional functionality
2145 of PCRE that is incompatible with Perl, but it is currently of very
2146 little use. When set, any backslash in a pattern that is followed by a
2147 letter that has no special meaning causes an error, thus reserving
2148 these combinations for future expansion. By default, as in Perl, a
2149 backslash followed by a letter with no special meaning is treated as a
2150 literal. (Perl can, however, be persuaded to give an error for this, by
2151 running it with the -w option.) There are at present no other features
2152 controlled by this option. It can also be set by a (?X) option setting
2153 within a pattern.
2154
2155 PCRE_FIRSTLINE
2156
2157 If this option is set, an unanchored pattern is required to match
2158 before or at the first newline in the subject string, though the
2159 matched text may continue over the newline.
2160
2161 PCRE_JAVASCRIPT_COMPAT
2162
2163 If this option is set, PCRE's behaviour is changed in some ways so that
2164 it is compatible with JavaScript rather than Perl. The changes are as
2165 follows:
2166
2167 (1) A lone closing square bracket in a pattern causes a compile-time
2168 error, because this is illegal in JavaScript (by default it is treated
2169 as a data character). Thus, the pattern AB]CD becomes illegal when this
2170 option is set.
2171
2172 (2) At run time, a back reference to an unset subpattern group matches
2173 an empty string (by default this causes the current matching alterna-
2174 tive to fail). A pattern such as (\1)(a) succeeds when this option is
2175 set (assuming it can find an "a" in the subject), whereas it fails by
2176 default, for Perl compatibility.
2177
2178 (3) \U matches an upper case "U" character; by default \U causes a com-
2179 pile time error (Perl uses \U to upper case subsequent characters).
2180
2181 (4) \u matches a lower case "u" character unless it is followed by four
2182 hexadecimal digits, in which case the hexadecimal number defines the
2183 code point to match. By default, \u causes a compile time error (Perl
2184 uses it to upper case the following character).
2185
2186 (5) \x matches a lower case "x" character unless it is followed by two
2187 hexadecimal digits, in which case the hexadecimal number defines the
2188 code point to match. By default, as in Perl, a hexadecimal number is
2189 always expected after \x, but it may have zero, one, or two digits (so,
2190 for example, \xz matches a binary zero character followed by z).
2191
2192 PCRE_MULTILINE
2193
2194 By default, PCRE treats the subject string as consisting of a single
2195 line of characters (even if it actually contains newlines). The "start
2196 of line" metacharacter (^) matches only at the start of the string,
2197 while the "end of line" metacharacter ($) matches only at the end of
2198 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
2199 is set). This is the same as Perl.
2200
2201 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
2202 constructs match immediately following or immediately before internal
2203 newlines in the subject string, respectively, as well as at the very
2204 start and end. This is equivalent to Perl's /m option, and it can be
2205 changed within a pattern by a (?m) option setting. If there are no new-
2206 lines in a subject string, or no occurrences of ^ or $ in a pattern,
2207 setting PCRE_MULTILINE has no effect.
2208
2209 PCRE_NEVER_UTF
2210
2211 This option locks out interpretation of the pattern as UTF-8 (or UTF-16
2212 or UTF-32 in the 16-bit and 32-bit libraries). In particular, it pre-
2213 vents the creator of the pattern from switching to UTF interpretation
2214 by starting the pattern with (*UTF). This may be useful in applications
2215 that process patterns from external sources. The combination of
2216 PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
2217
2218 PCRE_NEWLINE_CR
2219 PCRE_NEWLINE_LF
2220 PCRE_NEWLINE_CRLF
2221 PCRE_NEWLINE_ANYCRLF
2222 PCRE_NEWLINE_ANY
2223
2224 These options override the default newline definition that was chosen
2225 when PCRE was built. Setting the first or the second specifies that a
2226 newline is indicated by a single character (CR or LF, respectively).
2227 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2228 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
2229 that any of the three preceding sequences should be recognized. Setting
2230 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
2231 recognized.
2232
2233 In an ASCII/Unicode environment, the Unicode newline sequences are the
2234 three just mentioned, plus the single characters VT (vertical tab,
2235 U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2236 arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit
2237 library, the last two are recognized only in UTF-8 mode.
2238
2239 When PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2240 code for CR is 0x0d, the same as ASCII. However, the character code for
2241 LF is normally 0x15, though in some EBCDIC environments 0x25 is used.
2242 Whichever of these is not LF is made to correspond to Unicode's NEL
2243 character. EBCDIC codes are all less than 256. For more details, see
2244 the pcrebuild documentation.
2245
2246 The newline setting in the options word uses three bits that are
2247 treated as a number, giving eight possibilities. Currently only six are
2248 used (default plus the five values above). This means that if you set
2249 more than one newline option, the combination may or may not be sensi-
2250 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
2251 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
2252 cause an error.
2253
2254 The only time that a line break in a pattern is specially recognized
2255 when compiling is when PCRE_EXTENDED is set. CR and LF are white space
2256 characters, and so are ignored in this mode. Also, an unescaped # out-
2257 side a character class indicates a comment that lasts until after the
2258 next line break sequence. In other circumstances, line break sequences
2259 in patterns are treated as literal data.
2260
2261 The newline option that is set at compile time becomes the default that
2262 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
2263
2264 PCRE_NO_AUTO_CAPTURE
2265
2266 If this option is set, it disables the use of numbered capturing paren-
2267 theses in the pattern. Any opening parenthesis that is not followed by
2268 ? behaves as if it were followed by ?: but named parentheses can still
2269 be used for capturing (and they acquire numbers in the usual way).
2270 There is no equivalent of this option in Perl.
2271
2272 PCRE_NO_START_OPTIMIZE
2273
2274 This is an option that acts at matching time; that is, it is really an
2275 option for pcre_exec() or pcre_dfa_exec(). If it is set at compile
2276 time, it is remembered with the compiled pattern and assumed at match-
2277 ing time. This is necessary if you want to use JIT execution, because
2278 the JIT compiler needs to know whether or not this option is set. For
2279 details see the discussion of PCRE_NO_START_OPTIMIZE below.
2280
2281 PCRE_UCP
2282
2283 This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
2284 \w, and some of the POSIX character classes. By default, only ASCII
2285 characters are recognized, but if PCRE_UCP is set, Unicode properties
2286 are used instead to classify characters. More details are given in the
2287 section on generic character types in the pcrepattern page. If you set
2288 PCRE_UCP, matching one of the items it affects takes much longer. The
2289 option is available only if PCRE has been compiled with Unicode prop-
2290 erty support.
2291
2292 PCRE_UNGREEDY
2293
2294 This option inverts the "greediness" of the quantifiers so that they
2295 are not greedy by default, but become greedy if followed by "?". It is
2296 not compatible with Perl. It can also be set by a (?U) option setting
2297 within the pattern.
2298
2299 PCRE_UTF8
2300
2301 This option causes PCRE to regard both the pattern and the subject as
2302 strings of UTF-8 characters instead of single-byte strings. However, it
2303 is available only when PCRE is built to include UTF support. If not,
2304 the use of this option provokes an error. Details of how this option
2305 changes the behaviour of PCRE are given in the pcreunicode page.
2306
2307 PCRE_NO_UTF8_CHECK
2308
2309 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2310 automatically checked. There is a discussion about the validity of
2311 UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is
2312 found, pcre_compile() returns an error. If you already know that your
2313 pattern is valid, and you want to skip this check for performance rea-
2314 sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the
2315 effect of passing an invalid UTF-8 string as a pattern is undefined. It
2316 may cause your program to crash. Note that this option can also be
2317 passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
2318 checking of subject strings only. If the same string is being matched
2319 many times, the option can be safely set for the second and subsequent
2320 matchings to improve performance.
2321
2322
2323 COMPILATION ERROR CODES
2324
2325 The following table lists the error codes than may be returned by
2326 pcre_compile2(), along with the error messages that may be returned by
2327 both compiling functions. Note that error messages are always 8-bit
2328 ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed,
2329 some error codes have fallen out of use. To avoid confusion, they have
2330 not been re-used.
2331
2332 0 no error
2333 1 \ at end of pattern
2334 2 \c at end of pattern
2335 3 unrecognized character follows \
2336 4 numbers out of order in {} quantifier
2337 5 number too big in {} quantifier
2338 6 missing terminating ] for character class
2339 7 invalid escape sequence in character class
2340 8 range out of order in character class
2341 9 nothing to repeat
2342 10 [this code is not in use]
2343 11 internal error: unexpected repeat
2344 12 unrecognized character after (? or (?-
2345 13 POSIX named classes are supported only within a class
2346 14 missing )
2347 15 reference to non-existent subpattern
2348 16 erroffset passed as NULL
2349 17 unknown option bit(s) set
2350 18 missing ) after comment
2351 19 [this code is not in use]
2352 20 regular expression is too large
2353 21 failed to get memory
2354 22 unmatched parentheses
2355 23 internal error: code overflow
2356 24 unrecognized character after (?<
2357 25 lookbehind assertion is not fixed length
2358 26 malformed number or name after (?(
2359 27 conditional group contains more than two branches
2360 28 assertion expected after (?(
2361 29 (?R or (?[+-]digits must be followed by )
2362 30 unknown POSIX class name
2363 31 POSIX collating elements are not supported
2364 32 this version of PCRE is compiled without UTF support
2365 33 [this code is not in use]
2366 34 character value in \x{...} sequence is too large
2367 35 invalid condition (?(0)
2368 36 \C not allowed in lookbehind assertion
2369 37 PCRE does not support \L, \l, \N{name}, \U, or \u
2370 38 number after (?C is > 255
2371 39 closing ) for (?C expected
2372 40 recursive call could loop indefinitely
2373 41 unrecognized character after (?P
2374 42 syntax error in subpattern name (missing terminator)
2375 43 two named subpatterns have the same name
2376 44 invalid UTF-8 string (specifically UTF-8)
2377 45 support for \P, \p, and \X has not been compiled
2378 46 malformed \P or \p sequence
2379 47 unknown property name after \P or \p
2380 48 subpattern name is too long (maximum 32 characters)
2381 49 too many named subpatterns (maximum 10000)
2382 50 [this code is not in use]
2383 51 octal value is greater than \377 in 8-bit non-UTF-8 mode
2384 52 internal error: overran compiling workspace
2385 53 internal error: previously-checked referenced subpattern
2386 not found
2387 54 DEFINE group contains more than one branch
2388 55 repeating a DEFINE group is not allowed
2389 56 inconsistent NEWLINE options
2390 57 \g is not followed by a braced, angle-bracketed, or quoted
2391 name/number or by a plain number
2392 58 a numbered reference must not be zero
2393 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
2394 60 (*VERB) not recognized or malformed
2395 61 number is too big
2396 62 subpattern name expected
2397 63 digit expected after (?+
2398 64 ] is an invalid data character in JavaScript compatibility mode
2399 65 different names for subpatterns of the same number are
2400 not allowed
2401 66 (*MARK) must have an argument
2402 67 this version of PCRE is not compiled with Unicode property
2403 support
2404 68 \c must be followed by an ASCII character
2405 69 \k is not followed by a braced, angle-bracketed, or quoted name
2406 70 internal error: unknown opcode in find_fixedlength()
2407 71 \N is not supported in a class
2408 72 too many forward references
2409 73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
2410 74 invalid UTF-16 string (specifically UTF-16)
2411 75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2412 76 character value in \u.... sequence is too large
2413 77 invalid UTF-32 string (specifically UTF-32)
2414
2415 The numbers 32 and 10000 in errors 48 and 49 are defaults; different
2416 values may be used if the limits were changed when PCRE was built.
2417
2418
2419 STUDYING A PATTERN
2420
2421 pcre_extra *pcre_study(const pcre *code, int options
2422 const char **errptr);
2423
2424 If a compiled pattern is going to be used several times, it is worth
2425 spending more time analyzing it in order to speed up the time taken for
2426 matching. The function pcre_study() takes a pointer to a compiled pat-
2427 tern as its first argument. If studying the pattern produces additional
2428 information that will help speed up matching, pcre_study() returns a
2429 pointer to a pcre_extra block, in which the study_data field points to
2430 the results of the study.
2431
2432 The returned value from pcre_study() can be passed directly to
2433 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
2434 tains other fields that can be set by the caller before the block is
2435 passed; these are described below in the section on matching a pattern.
2436
2437 If studying the pattern does not produce any useful information,
2438 pcre_study() returns NULL by default. In that circumstance, if the
2439 calling program wants to pass any of the other fields to pcre_exec() or
2440 pcre_dfa_exec(), it must set up its own pcre_extra block. However, if
2441 pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it
2442 returns a pcre_extra block even if studying did not find any additional
2443 information. It may still return NULL, however, if an error occurs in
2444 pcre_study().
2445
2446 The second argument of pcre_study() contains option bits. There are
2447 three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2448
2449 PCRE_STUDY_JIT_COMPILE
2450 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2451 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2452
2453 If any of these are set, and the just-in-time compiler is available,
2454 the pattern is further compiled into machine code that executes much
2455 faster than the pcre_exec() interpretive matching function. If the
2456 just-in-time compiler is not available, these options are ignored. All
2457 undefined bits in the options argument must be zero.
2458
2459 JIT compilation is a heavyweight optimization. It can take some time
2460 for patterns to be analyzed, and for one-off matches and simple pat-
2461 terns the benefit of faster execution might be offset by a much slower
2462 study time. Not all patterns can be optimized by the JIT compiler. For
2463 those that cannot be handled, matching automatically falls back to the
2464 pcre_exec() interpreter. For more details, see the pcrejit documenta-
2465 tion.
2466
2467 The third argument for pcre_study() is a pointer for an error message.
2468 If studying succeeds (even if no data is returned), the variable it
2469 points to is set to NULL. Otherwise it is set to point to a textual
2470 error message. This is a static string that is part of the library. You
2471 must not try to free it. You should test the error pointer for NULL
2472 after calling pcre_study(), to be sure that it has run successfully.
2473
2474 When you are finished with a pattern, you can free the memory used for
2475 the study data by calling pcre_free_study(). This function was added to
2476 the API for release 8.20. For earlier versions, the memory could be
2477 freed with pcre_free(), just like the pattern itself. This will still
2478 work in cases where JIT optimization is not used, but it is advisable
2479 to change to the new function when convenient.
2480
2481 This is a typical way in which pcre_study() is used (except that in a
2482 real application there should be tests for errors):
2483
2484 int rc;
2485 pcre *re;
2486 pcre_extra *sd;
2487 re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
2488 sd = pcre_study(
2489 re, /* result of pcre_compile() */
2490 0, /* no options */
2491 &error); /* set to NULL or points to a message */
2492 rc = pcre_exec( /* see below for details of pcre_exec() options */
2493 re, sd, "subject", 7, 0, 0, ovector, 30);
2494 ...
2495 pcre_free_study(sd);
2496 pcre_free(re);
2497
2498 Studying a pattern does two things: first, a lower bound for the length
2499 of subject string that is needed to match the pattern is computed. This
2500 does not mean that there are any strings of that length that match, but
2501 it does guarantee that no shorter strings match. The value is used to
2502 avoid wasting time by trying to match strings that are shorter than the
2503 lower bound. You can find out the value in a calling program via the
2504 pcre_fullinfo() function.
2505
2506 Studying a pattern is also useful for non-anchored patterns that do not
2507 have a single fixed starting character. A bitmap of possible starting
2508 bytes is created. This speeds up finding a position in the subject at
2509 which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2510 values less than 256. In 32-bit mode, the bitmap is used for 32-bit
2511 values less than 256.)
2512
2513 These two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
2514 and the information is also used by the JIT compiler. The optimiza-
2515 tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
2516 You might want to do this if your pattern contains callouts or (*MARK)
2517 and you want to make use of these facilities in cases where matching
2518 fails.
2519
2520 PCRE_NO_START_OPTIMIZE can be specified at either compile time or exe-
2521 cution time. However, if PCRE_NO_START_OPTIMIZE is passed to
2522 pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-
2523 cution is disabled. For JIT execution to work with PCRE_NO_START_OPTI-
2524 MIZE, the option must be set at compile time.
2525
2526 There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
2527
2528
2529 LOCALE SUPPORT
2530
2531 PCRE handles caseless matching, and determines whether characters are
2532 letters, digits, or whatever, by reference to a set of tables, indexed
2533 by character value. When running in UTF-8 mode, this applies only to
2534 characters with codes less than 128. By default, higher-valued codes
2535 never match escapes such as \w or \d, but they can be tested with \p if
2536 PCRE is built with Unicode character property support. Alternatively,
2537 the PCRE_UCP option can be set at compile time; this causes \w and
2538 friends to use Unicode property support instead of built-in tables. The
2539 use of locales with Unicode is discouraged. If you are handling charac-
2540 ters with codes greater than 128, you should either use UTF-8 and Uni-
2541 code, or use locales, but not try to mix the two.
2542
2543 PCRE contains an internal set of tables that are used when the final
2544 argument of pcre_compile() is NULL. These are sufficient for many
2545 applications. Normally, the internal tables recognize only ASCII char-
2546 acters. However, when PCRE is built, it is possible to cause the inter-
2547 nal tables to be rebuilt in the default "C" locale of the local system,
2548 which may cause them to be different.
2549
2550 The internal tables can always be overridden by tables supplied by the
2551 application that calls PCRE. These may be created in a different locale
2552 from the default. As more and more applications change to using Uni-
2553 code, the need for this locale support is expected to die away.
2554
2555 External tables are built by calling the pcre_maketables() function,
2556 which has no arguments, in the relevant locale. The result can then be
2557 passed to pcre_compile() or pcre_exec() as often as necessary. For
2558 example, to build and use tables that are appropriate for the French
2559 locale (where accented characters with values greater than 128 are
2560 treated as letters), the following code could be used:
2561
2562 setlocale(LC_CTYPE, "fr_FR");
2563 tables = pcre_maketables();
2564 re = pcre_compile(..., tables);
2565
2566 The locale name "fr_FR" is used on Linux and other Unix-like systems;
2567 if you are using Windows, the name for the French locale is "french".
2568
2569 When pcre_maketables() runs, the tables are built in memory that is
2570 obtained via pcre_malloc. It is the caller's responsibility to ensure
2571 that the memory containing the tables remains available for as long as
2572 it is needed.
2573
2574 The pointer that is passed to pcre_compile() is saved with the compiled
2575 pattern, and the same tables are used via this pointer by pcre_study()
2576 and normally also by pcre_exec(). Thus, by default, for any single pat-
2577 tern, compilation, studying and matching all happen in the same locale,
2578 but different patterns can be compiled in different locales.
2579
2580 It is possible to pass a table pointer or NULL (indicating the use of
2581 the internal tables) to pcre_exec(). Although not intended for this
2582 purpose, this facility could be used to match a pattern in a different
2583 locale from the one in which it was compiled. Passing table pointers at
2584 run time is discussed below in the section on matching a pattern.
2585
2586
2587 INFORMATION ABOUT A PATTERN
2588
2589 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2590 int what, void *where);
2591
2592 The pcre_fullinfo() function returns information about a compiled pat-
2593 tern. It replaces the pcre_info() function, which was removed from the
2594 library at version 8.30, after more than 10 years of obsolescence.
2595
2596 The first argument for pcre_fullinfo() is a pointer to the compiled
2597 pattern. The second argument is the result of pcre_study(), or NULL if
2598 the pattern was not studied. The third argument specifies which piece
2599 of information is required, and the fourth argument is a pointer to a
2600 variable to receive the data. The yield of the function is zero for
2601 success, or one of the following negative numbers:
2602
2603 PCRE_ERROR_NULL the argument code was NULL
2604 the argument where was NULL
2605 PCRE_ERROR_BADMAGIC the "magic number" was not found
2606 PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
2607 endianness
2608 PCRE_ERROR_BADOPTION the value of what was invalid
2609 PCRE_ERROR_UNSET the requested field is not set
2610
2611 The "magic number" is placed at the start of each compiled pattern as
2612 an simple check against passing an arbitrary memory pointer. The endi-
2613 anness error can occur if a compiled pattern is saved and reloaded on a
2614 different host. Here is a typical call of pcre_fullinfo(), to obtain
2615 the length of the compiled pattern:
2616
2617 int rc;
2618 size_t length;
2619 rc = pcre_fullinfo(
2620 re, /* result of pcre_compile() */
2621 sd, /* result of pcre_study(), or NULL */
2622 PCRE_INFO_SIZE, /* what is required */
2623 &length); /* where to put the data */
2624
2625 The possible values for the third argument are defined in pcre.h, and
2626 are as follows:
2627
2628 PCRE_INFO_BACKREFMAX
2629
2630 Return the number of the highest back reference in the pattern. The
2631 fourth argument should point to an int variable. Zero is returned if
2632 there are no back references.
2633
2634 PCRE_INFO_CAPTURECOUNT
2635
2636 Return the number of capturing subpatterns in the pattern. The fourth
2637 argument should point to an int variable.
2638
2639 PCRE_INFO_DEFAULT_TABLES
2640
2641 Return a pointer to the internal default character tables within PCRE.
2642 The fourth argument should point to an unsigned char * variable. This
2643 information call is provided for internal use by the pcre_study() func-
2644 tion. External callers can cause PCRE to use its internal tables by
2645 passing a NULL table pointer.
2646
2647 PCRE_INFO_FIRSTBYTE
2648
2649 Return information about the first data unit of any matched string, for
2650 a non-anchored pattern. (The name of this option refers to the 8-bit
2651 library, where data units are bytes.) The fourth argument should point
2652 to an int variable.
2653
2654 If there is a fixed first value, for example, the letter "c" from a
2655 pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
2656 library, the value is always less than 256. In the 16-bit library the
2657 value can be up to 0xffff. In the 32-bit library the value can be up to
2658 0x10ffff.
2659
2660 If there is no fixed first value, and if either
2661
2662 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2663 branch starts with "^", or
2664
2665 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2666 set (if it were set, the pattern would be anchored),
2667
2668 -1 is returned, indicating that the pattern matches only at the start
2669 of a subject string or after any newline within the string. Otherwise
2670 -2 is returned. For anchored patterns, -2 is returned.
2671
2672 Since for the 32-bit library using the non-UTF-32 mode, this function
2673 is unable to return the full 32-bit range of the character, this value
2674 is deprecated; instead the PCRE_INFO_FIRSTCHARACTERFLAGS and
2675 PCRE_INFO_FIRSTCHARACTER values should be used.
2676
2677 PCRE_INFO_FIRSTTABLE
2678
2679 If the pattern was studied, and this resulted in the construction of a
2680 256-bit table indicating a fixed set of values for the first data unit
2681 in any matching string, a pointer to the table is returned. Otherwise
2682 NULL is returned. The fourth argument should point to an unsigned char
2683 * variable.
2684
2685 PCRE_INFO_HASCRORLF
2686
2687 Return 1 if the pattern contains any explicit matches for CR or LF
2688 characters, otherwise 0. The fourth argument should point to an int
2689 variable. An explicit match is either a literal CR or LF character, or
2690 \r or \n.
2691
2692 PCRE_INFO_JCHANGED
2693
2694 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2695 otherwise 0. The fourth argument should point to an int variable. (?J)
2696 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2697
2698 PCRE_INFO_JIT
2699
2700 Return 1 if the pattern was studied with one of the JIT options, and
2701 just-in-time compiling was successful. The fourth argument should point
2702 to an int variable. A return value of 0 means that JIT support is not
2703 available in this version of PCRE, or that the pattern was not studied
2704 with a JIT option, or that the JIT compiler could not handle this par-
2705 ticular pattern. See the pcrejit documentation for details of what can
2706 and cannot be handled.
2707
2708 PCRE_INFO_JITSIZE
2709
2710 If the pattern was successfully studied with a JIT option, return the
2711 size of the JIT compiled code, otherwise return zero. The fourth argu-
2712 ment should point to a size_t variable.
2713
2714 PCRE_INFO_LASTLITERAL
2715
2716 Return the value of the rightmost literal data unit that must exist in
2717 any matched string, other than at its start, if such a value has been
2718 recorded. The fourth argument should point to an int variable. If there
2719 is no such value, -1 is returned. For anchored patterns, a last literal
2720 value is recorded only if it follows something of variable length. For
2721 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2722 /^a\dz\d/ the returned value is -1.
2723
2724 Since for the 32-bit library using the non-UTF-32 mode, this function
2725 is unable to return the full 32-bit range of the character, this value
2726 is deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and
2727 PCRE_INFO_REQUIREDCHAR values should be used.
2728
2729 PCRE_INFO_MATCHLIMIT
2730
2731 If the pattern set a match limit by including an item of the form
2732 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth
2733 argument should point to an unsigned 32-bit integer. If no such value
2734 has been set, the call to pcre_fullinfo() returns the error
2735 PCRE_ERROR_UNSET.
2736
2737 PCRE_INFO_MAXLOOKBEHIND
2738
2739 Return the number of characters (NB not bytes) in the longest lookbe-
2740 hind assertion in the pattern. This information is useful when doing
2741 multi-segment matching using the partial matching facilities. Note that
2742 the simple assertions \b and \B require a one-character lookbehind. \A
2743 also registers a one-character lookbehind, though it does not actually
2744 inspect the previous character. This is to ensure that at least one
2745 character from the old segment is retained when a new segment is pro-
2746 cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
2747 match incorrectly at the start of a new segment.
2748
2749 PCRE_INFO_MINLENGTH
2750
2751 If the pattern was studied and a minimum length for matching subject
2752 strings was computed, its value is returned. Otherwise the returned
2753 value is -1. The value is a number of characters, which in UTF-8 mode
2754 may be different from the number of bytes. The fourth argument should
2755 point to an int variable. A non-negative value is a lower bound to the
2756 length of any matching string. There may not be any strings of that
2757 length that do actually match, but every string that does match is at
2758 least that long.
2759
2760 PCRE_INFO_NAMECOUNT
2761 PCRE_INFO_NAMEENTRYSIZE
2762 PCRE_INFO_NAMETABLE
2763
2764 PCRE supports the use of named as well as numbered capturing parenthe-
2765 ses. The names are just an additional way of identifying the parenthe-
2766 ses, which still acquire numbers. Several convenience functions such as
2767 pcre_get_named_substring() are provided for extracting captured sub-
2768 strings by name. It is also possible to extract the data directly, by
2769 first converting the name to a number in order to access the correct
2770 pointers in the output vector (described with pcre_exec() below). To do
2771 the conversion, you need to use the name-to-number map, which is
2772 described by these three values.
2773
2774 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2775 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2776 of each entry; both of these return an int value. The entry size
2777 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
2778 a pointer to the first entry of the table. This is a pointer to char in
2779 the 8-bit library, where the first two bytes of each entry are the num-
2780 ber of the capturing parenthesis, most significant byte first. In the
2781 16-bit library, the pointer points to 16-bit data units, the first of
2782 which contains the parenthesis number. In the 32-bit library, the
2783 pointer points to 32-bit data units, the first of which contains the
2784 parenthesis number. The rest of the entry is the corresponding name,
2785 zero terminated.
2786
2787 The names are in alphabetical order. Duplicate names may appear if (?|
2788 is used to create multiple groups with the same number, as described in
2789 the section on duplicate subpattern numbers in the pcrepattern page.
2790 Duplicate names for subpatterns with different numbers are permitted
2791 only if PCRE_DUPNAMES is set. In all cases of duplicate names, they
2792 appear in the table in the order in which they were found in the pat-
2793 tern. In the absence of (?| this is the order of increasing number;
2794 when (?| is used this is not necessarily the case because later subpat-
2795 terns may have lower numbers.
2796
2797 As a simple example of the name/number table, consider the following
2798 pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2799 set, so white space - including newlines - is ignored):
2800
2801 (?<date> (?<year>(\d\d)?\d\d) -
2802 (?<month>\d\d) - (?<day>\d\d) )
2803
2804 There are four named subpatterns, so the table has four entries, and
2805 each entry in the table is eight bytes long. The table is as follows,
2806 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2807 as ??:
2808
2809 00 01 d a t e 00 ??
2810 00 05 d a y 00 ?? ??
2811 00 04 m o n t h 00
2812 00 02 y e a r 00 ??
2813
2814 When writing code to extract data from named subpatterns using the
2815 name-to-number map, remember that the length of the entries is likely
2816 to be different for each compiled pattern.
2817
2818 PCRE_INFO_OKPARTIAL
2819
2820 Return 1 if the pattern can be used for partial matching with
2821 pcre_exec(), otherwise 0. The fourth argument should point to an int
2822 variable. From release 8.00, this always returns 1, because the
2823 restrictions that previously applied to partial matching have been
2824 lifted. The pcrepartial documentation gives details of partial match-
2825 ing.
2826
2827 PCRE_INFO_OPTIONS
2828
2829 Return a copy of the options with which the pattern was compiled. The
2830 fourth argument should point to an unsigned long int variable. These
2831 option bits are those specified in the call to pcre_compile(), modified
2832 by any top-level option settings at the start of the pattern itself. In
2833 other words, they are the options that will be in force when matching
2834 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
2835 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
2836 and PCRE_EXTENDED.
2837
2838 A pattern is automatically anchored by PCRE if all of its top-level
2839 alternatives begin with one of the following:
2840
2841 ^ unless PCRE_MULTILINE is set
2842 \A always
2843 \G always
2844 .* if PCRE_DOTALL is set and there are no back
2845 references to the subpattern in which .* appears
2846
2847 For such patterns, the PCRE_ANCHORED bit is set in the options returned
2848 by pcre_fullinfo().
2849
2850 PCRE_INFO_RECURSIONLIMIT
2851
2852 If the pattern set a recursion limit by including an item of the form
2853 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
2854 argument should point to an unsigned 32-bit integer. If no such value
2855 has been set, the call to pcre_fullinfo() returns the error
2856 PCRE_ERROR_UNSET.
2857
2858 PCRE_INFO_SIZE
2859
2860 Return the size of the compiled pattern in bytes (for both libraries).
2861 The fourth argument should point to a size_t variable. This value does
2862 not include the size of the pcre structure that is returned by
2863 pcre_compile(). The value that is passed as the argument to pcre_mal-
2864 loc() when pcre_compile() is getting memory in which to place the com-
2865 piled data is the value returned by this option plus the size of the
2866 pcre structure. Studying a compiled pattern, with or without JIT, does
2867 not alter the value returned by this option.
2868
2869 PCRE_INFO_STUDYSIZE
2870
2871 Return the size in bytes of the data block pointed to by the study_data
2872 field in a pcre_extra block. If pcre_extra is NULL, or there is no
2873 study data, zero is returned. The fourth argument should point to a
2874 size_t variable. The study_data field is set by pcre_study() to record
2875 information that will speed up matching (see the section entitled
2876 "Studying a pattern" above). The format of the study_data block is pri-
2877 vate, but its length is made available via this option so that it can
2878 be saved and restored (see the pcreprecompile documentation for
2879 details).
2880
2881 PCRE_INFO_FIRSTCHARACTERFLAGS
2882
2883 Return information about the first data unit of any matched string, for
2884 a non-anchored pattern. The fourth argument should point to an int
2885 variable.
2886
2887 If there is a fixed first value, for example, the letter "c" from a
2888 pattern such as (cat|cow|coyote), 1 is returned, and the character
2889 value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
2890
2891 If there is no fixed first value, and if either
2892
2893 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2894 branch starts with "^", or
2895
2896 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2897 set (if it were set, the pattern would be anchored),
2898
2899 2 is returned, indicating that the pattern matches only at the start of
2900 a subject string or after any newline within the string. Otherwise 0 is
2901 returned. For anchored patterns, 0 is returned.
2902
2903 PCRE_INFO_FIRSTCHARACTER
2904
2905 Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTER-
2906 FLAGS returned 1; otherwise returns 0. The fourth argument should point
2907 to an uint_t variable.
2908
2909 In the 8-bit library, the value is always less than 256. In the 16-bit
2910 library the value can be up to 0xffff. In the 32-bit library in UTF-32
2911 mode the value can be up to 0x10ffff, and up to 0xffffffff when not
2912 using UTF-32 mode.
2913
2914 If there is no fixed first value, and if either
2915
2916 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2917 branch starts with "^", or
2918
2919 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2920 set (if it were set, the pattern would be anchored),
2921
2922 -1 is returned, indicating that the pattern matches only at the start
2923 of a subject string or after any newline within the string. Otherwise
2924 -2 is returned. For anchored patterns, -2 is returned.
2925
2926 PCRE_INFO_REQUIREDCHARFLAGS
2927
2928 Returns 1 if there is a rightmost literal data unit that must exist in
2929 any matched string, other than at its start. The fourth argument should
2930 point to an int variable. If there is no such value, 0 is returned. If
2931 returning 1, the character value itself can be retrieved using
2932 PCRE_INFO_REQUIREDCHAR.
2933
2934 For anchored patterns, a last literal value is recorded only if it fol-
2935 lows something of variable length. For example, for the pattern
2936 /^a\d+z\d+/ the returned value 1 (with "z" returned from
2937 PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
2938
2939 PCRE_INFO_REQUIREDCHAR
2940
2941 Return the value of the rightmost literal data unit that must exist in
2942 any matched string, other than at its start, if such a value has been
2943 recorded. The fourth argument should point to an uint32_t variable. If
2944 there is no such value, 0 is returned.
2945
2946
2947 REFERENCE COUNTS
2948
2949 int pcre_refcount(pcre *code, int adjust);
2950
2951 The pcre_refcount() function is used to maintain a reference count in
2952 the data block that contains a compiled pattern. It is provided for the
2953 benefit of applications that operate in an object-oriented manner,
2954 where different parts of the application may be using the same compiled
2955 pattern, but you want to free the block when they are all done.
2956
2957 When a pattern is compiled, the reference count field is initialized to
2958 zero. It is changed only by calling this function, whose action is to
2959 add the adjust value (which may be positive or negative) to it. The
2960 yield of the function is the new value. However, the value of the count
2961 is constrained to lie between 0 and 65535, inclusive. If the new value
2962 is outside these limits, it is forced to the appropriate limit value.
2963
2964 Except when it is zero, the reference count is not correctly preserved
2965 if a pattern is compiled on one host and then transferred to a host
2966 whose byte-order is different. (This seems a highly unlikely scenario.)
2967
2968
2969 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2970
2971 int pcre_exec(const pcre *code, const pcre_extra *extra,
2972 const char *subject, int length, int startoffset,
2973 int options, int *ovector, int ovecsize);
2974
2975 The function pcre_exec() is called to match a subject string against a
2976 compiled pattern, which is passed in the code argument. If the pattern
2977 was studied, the result of the study should be passed in the extra
2978 argument. You can call pcre_exec() with the same code and extra argu-
2979 ments as many times as you like, in order to match different subject
2980 strings with the same pattern.
2981
2982 This function is the main matching facility of the library, and it
2983 operates in a Perl-like manner. For specialist use there is also an
2984 alternative matching function, which is described below in the section
2985 about the pcre_dfa_exec() function.
2986
2987 In most applications, the pattern will have been compiled (and option-
2988 ally studied) in the same process that calls pcre_exec(). However, it
2989 is possible to save compiled patterns and study data, and then use them
2990 later in different processes, possibly even on different hosts. For a
2991 discussion about this, see the pcreprecompile documentation.
2992
2993 Here is an example of a simple call to pcre_exec():
2994
2995 int rc;
2996 int ovector[30];
2997 rc = pcre_exec(
2998 re, /* result of pcre_compile() */
2999 NULL, /* we didn't study the pattern */
3000 "some string", /* the subject string */
3001 11, /* the length of the subject string */
3002 0, /* start at offset 0 in the subject */
3003 0, /* default options */
3004 ovector, /* vector of integers for substring information */
3005 30); /* number of elements (NOT size in bytes) */
3006
3007 Extra data for pcre_exec()
3008
3009 If the extra argument is not NULL, it must point to a pcre_extra data
3010 block. The pcre_study() function returns such a block (when it doesn't
3011 return NULL), but you can also create one for yourself, and pass addi-
3012 tional information in it. The pcre_extra block contains the following
3013 fields (not necessarily in this order):
3014
3015 unsigned long int flags;
3016 void *study_data;
3017 void *executable_jit;
3018 unsigned long int match_limit;
3019 unsigned long int match_limit_recursion;
3020 void *callout_data;
3021 const unsigned char *tables;
3022 unsigned char **mark;
3023
3024 In the 16-bit version of this structure, the mark field has type
3025 "PCRE_UCHAR16 **".
3026
3027 In the 32-bit version of this structure, the mark field has type
3028 "PCRE_UCHAR32 **".
3029
3030 The flags field is used to specify which of the other fields are set.
3031 The flag bits are:
3032
3033 PCRE_EXTRA_CALLOUT_DATA
3034 PCRE_EXTRA_EXECUTABLE_JIT
3035 PCRE_EXTRA_MARK
3036 PCRE_EXTRA_MATCH_LIMIT
3037 PCRE_EXTRA_MATCH_LIMIT_RECURSION
3038 PCRE_EXTRA_STUDY_DATA
3039 PCRE_EXTRA_TABLES
3040
3041 Other flag bits should be set to zero. The study_data field and some-
3042 times the executable_jit field are set in the pcre_extra block that is
3043 returned by pcre_study(), together with the appropriate flag bits. You
3044 should not set these yourself, but you may add to the block by setting
3045 other fields and their corresponding flag bits.
3046
3047 The match_limit field provides a means of preventing PCRE from using up
3048 a vast amount of resources when running patterns that are not going to
3049 match, but which have a very large number of possibilities in their
3050 search trees. The classic example is a pattern that uses nested unlim-
3051 ited repeats.
3052
3053 Internally, pcre_exec() uses a function called match(), which it calls
3054 repeatedly (sometimes recursively). The limit set by match_limit is
3055 imposed on the number of times this function is called during a match,
3056 which has the effect of limiting the amount of backtracking that can
3057 take place. For patterns that are not anchored, the count restarts from
3058 zero for each position in the subject string.
3059
3060 When pcre_exec() is called with a pattern that was successfully studied
3061 with a JIT option, the way that the matching is executed is entirely
3062 different. However, there is still the possibility of runaway matching
3063 that goes on for a very long time, and so the match_limit value is also
3064 used in this case (but in a different way) to limit how long the match-
3065 ing can continue.
3066
3067 The default value for the limit can be set when PCRE is built; the
3068 default default is 10 million, which handles all but the most extreme
3069 cases. You can override the default by suppling pcre_exec() with a
3070 pcre_extra block in which match_limit is set, and
3071 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
3072 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
3073
3074 A value for the match limit may also be supplied by an item at the
3075 start of a pattern of the form
3076
3077 (*LIMIT_MATCH=d)
3078
3079 where d is a decimal number. However, such a setting is ignored unless
3080 d is less than the limit set by the caller of pcre_exec() or, if no
3081 such limit is set, less than the default.
3082
3083 The match_limit_recursion field is similar to match_limit, but instead
3084 of limiting the total number of times that match() is called, it limits
3085 the depth of recursion. The recursion depth is a smaller number than
3086 the total number of calls, because not all calls to match() are recur-
3087 sive. This limit is of use only if it is set smaller than match_limit.
3088
3089 Limiting the recursion depth limits the amount of machine stack that
3090 can be used, or, when PCRE has been compiled to use memory on the heap
3091 instead of the stack, the amount of heap memory that can be used. This
3092 limit is not relevant, and is ignored, when matching is done using JIT
3093 compiled code.
3094
3095 The default value for match_limit_recursion can be set when PCRE is
3096 built; the default default is the same value as the default for
3097 match_limit. You can override the default by suppling pcre_exec() with
3098 a pcre_extra block in which match_limit_recursion is set, and
3099 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
3100 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
3101
3102 A value for the recursion limit may also be supplied by an item at the
3103 start of a pattern of the form
3104
3105 (*LIMIT_RECURSION=d)
3106
3107 where d is a decimal number. However, such a setting is ignored unless
3108 d is less than the limit set by the caller of pcre_exec() or, if no
3109 such limit is set, less than the default.
3110
3111 The callout_data field is used in conjunction with the "callout" fea-
3112 ture, and is described in the pcrecallout documentation.
3113
3114 The tables field is used to pass a character tables pointer to
3115 pcre_exec(); this overrides the value that is stored with the compiled
3116 pattern. A non-NULL value is stored with the compiled pattern only if
3117 custom tables were supplied to pcre_compile() via its tableptr argu-
3118 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
3119 PCRE's internal tables to be used. This facility is helpful when re-
3120 using patterns that have been saved after compiling with an external
3121 set of tables, because the external tables might be at a different
3122 address when pcre_exec() is called. See the pcreprecompile documenta-
3123 tion for a discussion of saving compiled patterns for later use.
3124
3125 If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
3126 set to point to a suitable variable. If the pattern contains any back-
3127 tracking control verbs such as (*MARK:NAME), and the execution ends up
3128 with a name to pass back, a pointer to the name string (zero termi-
3129 nated) is placed in the variable pointed to by the mark field. The
3130 names are within the compiled pattern; if you wish to retain such a
3131 name you must copy it before freeing the memory of a compiled pattern.
3132 If there is no name to pass back, the variable pointed to by the mark
3133 field is set to NULL. For details of the backtracking control verbs,
3134 see the section entitled "Backtracking control" in the pcrepattern doc-
3135 umentation.
3136
3137 Option bits for pcre_exec()
3138
3139 The unused bits of the options argument for pcre_exec() must be zero.
3140 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
3141 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3142 PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
3143 PCRE_PARTIAL_SOFT.
3144
3145 If the pattern was successfully studied with one of the just-in-time
3146 (JIT) compile options, the only supported options for JIT execution are
3147 PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
3148 PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
3149 unsupported option is used, JIT execution is disabled and the normal
3150 interpretive code in pcre_exec() is run.
3151
3152 PCRE_ANCHORED
3153
3154 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
3155 matching position. If a pattern was compiled with PCRE_ANCHORED, or
3156 turned out to be anchored by virtue of its contents, it cannot be made
3157 unachored at matching time.
3158
3159 PCRE_BSR_ANYCRLF
3160 PCRE_BSR_UNICODE
3161
3162 These options (which are mutually exclusive) control what the \R escape
3163 sequence matches. The choice is either to match only CR, LF, or CRLF,
3164 or to match any Unicode newline sequence. These options override the
3165 choice that was made or defaulted when the pattern was compiled.
3166
3167 PCRE_NEWLINE_CR
3168 PCRE_NEWLINE_LF
3169 PCRE_NEWLINE_CRLF
3170 PCRE_NEWLINE_ANYCRLF
3171 PCRE_NEWLINE_ANY
3172
3173 These options override the newline definition that was chosen or
3174 defaulted when the pattern was compiled. For details, see the descrip-
3175 tion of pcre_compile() above. During matching, the newline choice
3176 affects the behaviour of the dot, circumflex, and dollar metacharac-
3177 ters. It may also alter the way the match position is advanced after a
3178 match failure for an unanchored pattern.
3179
3180 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
3181 set, and a match attempt for an unanchored pattern fails when the cur-
3182 rent position is at a CRLF sequence, and the pattern contains no
3183 explicit matches for CR or LF characters, the match position is
3184 advanced by two characters instead of one, in other words, to after the
3185 CRLF.
3186
3187 The above rule is a compromise that makes the most common cases work as
3188 expected. For example, if the pattern is .+A (and the PCRE_DOTALL
3189 option is not set), it does not match the string "\r\nA" because, after
3190 failing at the start, it skips both the CR and the LF before retrying.
3191 However, the pattern [\r\n]A does match that string, because it con-
3192 tains an explicit CR or LF reference, and so advances only by one char-
3193 acter after the first failure.
3194
3195 An explicit match for CR of LF is either a literal appearance of one of
3196 those characters, or one of the \r or \n escape sequences. Implicit
3197 matches such as [^X] do not count, nor does \s (which includes CR and
3198 LF in the characters that it matches).
3199
3200 Notwithstanding the above, anomalous effects may still occur when CRLF
3201 is a valid newline sequence and explicit \r or \n escapes appear in the
3202 pattern.
3203
3204 PCRE_NOTBOL
3205
3206 This option specifies that first character of the subject string is not
3207 the beginning of a line, so the circumflex metacharacter should not
3208 match before it. Setting this without PCRE_MULTILINE (at compile time)
3209 causes circumflex never to match. This option affects only the behav-
3210 iour of the circumflex metacharacter. It does not affect \A.
3211
3212 PCRE_NOTEOL
3213
3214 This option specifies that the end of the subject string is not the end
3215 of a line, so the dollar metacharacter should not match it nor (except
3216 in multiline mode) a newline immediately before it. Setting this with-
3217 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
3218 option affects only the behaviour of the dollar metacharacter. It does
3219 not affect \Z or \z.
3220
3221 PCRE_NOTEMPTY
3222
3223 An empty string is not considered to be a valid match if this option is
3224 set. If there are alternatives in the pattern, they are tried. If all
3225 the alternatives match the empty string, the entire match fails. For
3226 example, if the pattern
3227
3228 a?b?
3229
3230 is applied to a string not beginning with "a" or "b", it matches an
3231 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
3232 match is not valid, so PCRE searches further into the string for occur-
3233 rences of "a" or "b".
3234
3235 PCRE_NOTEMPTY_ATSTART
3236
3237 This is like PCRE_NOTEMPTY, except that an empty string match that is
3238 not at the start of the subject is permitted. If the pattern is
3239 anchored, such a match can occur only if the pattern contains \K.
3240
3241 Perl has no direct equivalent of PCRE_NOTEMPTY or
3242 PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
3243 match of the empty string within its split() function, and when using
3244 the /g modifier. It is possible to emulate Perl's behaviour after
3245 matching a null string by first trying the match again at the same off-
3246 set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
3247 fails, by advancing the starting offset (see below) and trying an ordi-
3248 nary match again. There is some code that demonstrates how to do this
3249 in the pcredemo sample program. In the most general case, you have to
3250 check to see if the newline convention recognizes CRLF as a newline,
3251 and if so, and the current character is CR followed by LF, advance the
3252 starting offset by two characters instead of one.
3253
3254 PCRE_NO_START_OPTIMIZE
3255
3256 There are a number of optimizations that pcre_exec() uses at the start
3257 of a match, in order to speed up the process. For example, if it is
3258 known that an unanchored match must start with a specific character, it
3259 searches the subject for that character, and fails immediately if it
3260 cannot find it, without actually running the main matching function.
3261 This means that a special item such as (*COMMIT) at the start of a pat-
3262 tern is not considered until after a suitable starting point for the
3263 match has been found. Also, when callouts or (*MARK) items are in use,
3264 these "start-up" optimizations can cause them to be skipped if the pat-
3265 tern is never actually used. The start-up optimizations are in effect a
3266 pre-scan of the subject that takes place before the pattern is run.
3267
3268 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
3269 possibly causing performance to suffer, but ensuring that in cases
3270 where the result is "no match", the callouts do occur, and that items
3271 such as (*COMMIT) and (*MARK) are considered at every possible starting
3272 position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
3273 compile time, it cannot be unset at matching time. The use of
3274 PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to
3275 pcre_exec()) disables JIT execution; in this situation, matching is
3276 always done using interpretively.
3277
3278 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
3279 operation. Consider the pattern
3280
3281 (*COMMIT)ABC
3282
3283 When this is compiled, PCRE records the fact that a match must start
3284 with the character "A". Suppose the subject string is "DEFABC". The
3285 start-up optimization scans along the subject, finds "A" and runs the
3286 first match attempt from there. The (*COMMIT) item means that the pat-
3287 tern must match the current starting position, which in this case, it
3288 does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
3289 set, the initial scan along the subject string does not happen. The
3290 first match attempt is run starting from "D" and when this fails,
3291 (*COMMIT) prevents any further matches being tried, so the overall
3292 result is "no match". If the pattern is studied, more start-up opti-
3293 mizations may be used. For example, a minimum length for the subject
3294 may be recorded. Consider the pattern
3295
3296 (*MARK:A)(X|Y)
3297
3298 The minimum length for a match is one character. If the subject is
3299 "ABC", there will be attempts to match "ABC", "BC", "C", and then
3300 finally an empty string. If the pattern is studied, the final attempt
3301 does not take place, because PCRE knows that the subject is too short,
3302 and so the (*MARK) is never encountered. In this case, studying the
3303 pattern does not affect the overall match result, which is still "no
3304 match", but it does affect the auxiliary information that is returned.
3305
3306 PCRE_NO_UTF8_CHECK
3307
3308 When PCRE_UTF8 is set at compile time, the validity of the subject as a
3309 UTF-8 string is automatically checked when pcre_exec() is subsequently
3310 called. The entire string is checked before any other processing takes
3311 place. The value of startoffset is also checked to ensure that it
3312 points to the start of a UTF-8 character. There is a discussion about
3313 the validity of UTF-8 strings in the pcreunicode page. If an invalid
3314 sequence of bytes is found, pcre_exec() returns the error
3315 PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3316 truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
3317 both cases, information about the precise nature of the error may also
3318 be returned (see the descriptions of these errors in the section enti-
3319 tled Error return values from pcre_exec() below). If startoffset con-
3320 tains a value that does not point to the start of a UTF-8 character (or
3321 to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
3322
3323 If you already know that your subject is valid, and you want to skip
3324 these checks for performance reasons, you can set the
3325 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
3326 do this for the second and subsequent calls to pcre_exec() if you are
3327 making repeated calls to find all the matches in a single subject
3328 string. However, you should be sure that the value of startoffset
3329 points to the start of a character (or the end of the subject). When
3330 PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3331 subject or an invalid value of startoffset is undefined. Your program
3332 may crash.
3333
3334 PCRE_PARTIAL_HARD
3335 PCRE_PARTIAL_SOFT
3336
3337 These options turn on the partial matching feature. For backwards com-
3338 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
3339 match occurs if the end of the subject string is reached successfully,
3340 but there are not enough subject characters to complete the match. If
3341 this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
3342 matching continues by testing any remaining alternatives. Only if no
3343 complete match can be found is PCRE_ERROR_PARTIAL returned instead of
3344 PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
3345 caller is prepared to handle a partial match, but only if no complete
3346 match can be found.
3347
3348 If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
3349 case, if a partial match is found, pcre_exec() immediately returns
3350 PCRE_ERROR_PARTIAL, without considering any other alternatives. In
3351 other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
3352 ered to be more important that an alternative complete match.
3353
3354 In both cases, the portion of the string that was inspected when the
3355 partial match was found is set as the first matching string. There is a
3356 more detailed discussion of partial and multi-segment matching, with
3357 examples, in the pcrepartial documentation.
3358
3359 The string to be matched by pcre_exec()
3360
3361 The subject string is passed to pcre_exec() as a pointer in subject, a
3362 length in bytes in length, and a starting byte offset in startoffset.
3363 If this is negative or greater than the length of the subject,
3364 pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
3365 zero, the search for a match starts at the beginning of the subject,
3366 and this is by far the most common case. In UTF-8 mode, the byte offset
3367 must point to the start of a UTF-8 character (or the end of the sub-
3368 ject). Unlike the pattern string, the subject may contain binary zero
3369 bytes.
3370
3371 A non-zero starting offset is useful when searching for another match
3372 in the same subject by calling pcre_exec() again after a previous suc-
3373 cess. Setting startoffset differs from just passing over a shortened
3374 string and setting PCRE_NOTBOL in the case of a pattern that begins
3375 with any kind of lookbehind. For example, consider the pattern
3376
3377 \Biss\B
3378
3379 which finds occurrences of "iss" in the middle of words. (\B matches
3380 only if the current position in the subject is not a word boundary.)
3381 When applied to the string "Mississipi" the first call to pcre_exec()
3382 finds the first occurrence. If pcre_exec() is called again with just
3383 the remainder of the subject, namely "issipi", it does not match,
3384 because \B is always false at the start of the subject, which is deemed
3385 to be a word boundary. However, if pcre_exec() is passed the entire
3386 string again, but with startoffset set to 4, it finds the second occur-
3387 rence of "iss" because it is able to look behind the starting point to
3388 discover that it is preceded by a letter.
3389
3390 Finding all the matches in a subject is tricky when the pattern can
3391 match an empty string. It is possible to emulate Perl's /g behaviour by
3392 first trying the match again at the same offset, with the
3393 PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
3394 fails, advancing the starting offset and trying an ordinary match
3395 again. There is some code that demonstrates how to do this in the pcre-
3396 demo sample program. In the most general case, you have to check to see
3397 if the newline convention recognizes CRLF as a newline, and if so, and
3398 the current character is CR followed by LF, advance the starting offset
3399 by two characters instead of one.
3400
3401 If a non-zero starting offset is passed when the pattern is anchored,
3402 one attempt to match at the given offset is made. This can only succeed
3403 if the pattern does not require the match to be at the start of the
3404 subject.
3405
3406 How pcre_exec() returns captured substrings
3407
3408 In general, a pattern matches a certain portion of the subject, and in
3409 addition, further substrings from the subject may be picked out by
3410 parts of the pattern. Following the usage in Jeffrey Friedl's book,
3411 this is called "capturing" in what follows, and the phrase "capturing
3412 subpattern" is used for a fragment of a pattern that picks out a sub-
3413 string. PCRE supports several other kinds of parenthesized subpattern
3414 that do not cause substrings to be captured.
3415
3416 Captured substrings are returned to the caller via a vector of integers
3417 whose address is passed in ovector. The number of elements in the vec-
3418 tor is passed in ovecsize, which must be a non-negative number. Note:
3419 this argument is NOT the size of ovector in bytes.
3420
3421 The first two-thirds of the vector is used to pass back captured sub-
3422 strings, each substring using a pair of integers. The remaining third
3423 of the vector is used as workspace by pcre_exec() while matching cap-
3424 turing subpatterns, and is not available for passing back information.
3425 The number passed in ovecsize should always be a multiple of three. If
3426 it is not, it is rounded down.
3427
3428 When a match is successful, information about captured substrings is
3429 returned in pairs of integers, starting at the beginning of ovector,
3430 and continuing up to two-thirds of its length at the most. The first
3431 element of each pair is set to the byte offset of the first character
3432 in a substring, and the second is set to the byte offset of the first
3433 character after the end of a substring. Note: these values are always
3434 byte offsets, even in UTF-8 mode. They are not character counts.
3435
3436 The first pair of integers, ovector[0] and ovector[1], identify the
3437 portion of the subject string matched by the entire pattern. The next
3438 pair is used for the first capturing subpattern, and so on. The value
3439 returned by pcre_exec() is one more than the highest numbered pair that
3440 has been set. For example, if two substrings have been captured, the
3441 returned value is 3. If there are no capturing subpatterns, the return
3442 value from a successful match is 1, indicating that just the first pair
3443 of offsets has been set.
3444
3445 If a capturing subpattern is matched repeatedly, it is the last portion
3446 of the string that it matched that is returned.
3447
3448 If the vector is too small to hold all the captured substring offsets,
3449 it is used as far as possible (up to two-thirds of its length), and the
3450 function returns a value of zero. If neither the actual string matched
3451 nor any captured substrings are of interest, pcre_exec() may be called
3452 with ovector passed as NULL and ovecsize as zero. However, if the pat-
3453 tern contains back references and the ovector is not big enough to
3454 remember the related substrings, PCRE has to get additional memory for
3455 use during matching. Thus it is usually advisable to supply an ovector
3456 of reasonable size.
3457
3458 There are some cases where zero is returned (indicating vector over-
3459 flow) when in fact the vector is exactly the right size for the final
3460 match. For example, consider the pattern
3461
3462 (a)(?:(b)c|bd)
3463
3464 If a vector of 6 elements (allowing for only 1 captured substring) is
3465 given with subject string "abd", pcre_exec() will try to set the second
3466 captured string, thereby recording a vector overflow, before failing to
3467 match "c" and backing up to try the second alternative. The zero
3468 return, however, does correctly indicate that the maximum number of
3469 slots (namely 2) have been filled. In similar cases where there is tem-
3470 porary overflow, but the final number of used slots is actually less
3471 than the maximum, a non-zero value is returned.
3472
3473 The pcre_fullinfo() function can be used to find out how many capturing
3474 subpatterns there are in a compiled pattern. The smallest size for
3475 ovector that will allow for n captured substrings, in addition to the
3476 offsets of the substring matched by the whole pattern, is (n+1)*3.
3477
3478 It is possible for capturing subpattern number n+1 to match some part
3479 of the subject when subpattern n has not been used at all. For example,
3480 if the string "abc" is matched against the pattern (a|(z))(bc) the
3481 return from the function is 4, and subpatterns 1 and 3 are matched, but
3482 2 is not. When this happens, both values in the offset pairs corre-
3483 sponding to unused subpatterns are set to -1.
3484
3485 Offset values that correspond to unused subpatterns at the end of the
3486 expression are also set to -1. For example, if the string "abc" is
3487 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
3488 matched. The return from the function is 2, because the highest used
3489 capturing subpattern number is 1, and the offsets for for the second
3490 and third capturing subpatterns (assuming the vector is large enough,
3491 of course) are set to -1.
3492
3493 Note: Elements in the first two-thirds of ovector that do not corre-
3494 spond to capturing parentheses in the pattern are never changed. That
3495 is, if a pattern contains n capturing parentheses, no more than ovec-
3496 tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
3497 the first two-thirds) retain whatever values they previously had.
3498
3499 Some convenience functions are provided for extracting the captured
3500 substrings as separate strings. These are described below.
3501
3502 Error return values from pcre_exec()
3503
3504 If pcre_exec() fails, it returns a negative number. The following are
3505 defined in the header file:
3506
3507 PCRE_ERROR_NOMATCH (-1)
3508
3509 The subject string did not match the pattern.
3510
3511 PCRE_ERROR_NULL (-2)
3512
3513 Either code or subject was passed as NULL, or ovector was NULL and
3514 ovecsize was not zero.
3515
3516 PCRE_ERROR_BADOPTION (-3)
3517
3518 An unrecognized bit was set in the options argument.
3519
3520 PCRE_ERROR_BADMAGIC (-4)
3521
3522 PCRE stores a 4-byte "magic number" at the start of the compiled code,
3523 to catch the case when it is passed a junk pointer and to detect when a
3524 pattern that was compiled in an environment of one endianness is run in
3525 an environment with the other endianness. This is the error that PCRE
3526 gives when the magic number is not present.
3527
3528 PCRE_ERROR_UNKNOWN_OPCODE (-5)
3529
3530 While running the pattern match, an unknown item was encountered in the
3531 compiled pattern. This error could be caused by a bug in PCRE or by
3532 overwriting of the compiled pattern.
3533
3534 PCRE_ERROR_NOMEMORY (-6)
3535
3536 If a pattern contains back references, but the ovector that is passed
3537 to pcre_exec() is not big enough to remember the referenced substrings,
3538 PCRE gets a block of memory at the start of matching to use for this
3539 purpose. If the call via pcre_malloc() fails, this error is given. The
3540 memory is automatically freed at the end of matching.
3541
3542 This error is also given if pcre_stack_malloc() fails in pcre_exec().
3543 This can happen only when PCRE has been compiled with --disable-stack-
3544 for-recursion.
3545
3546 PCRE_ERROR_NOSUBSTRING (-7)
3547
3548 This error is used by the pcre_copy_substring(), pcre_get_substring(),
3549 and pcre_get_substring_list() functions (see below). It is never
3550 returned by pcre_exec().
3551
3552 PCRE_ERROR_MATCHLIMIT (-8)
3553
3554 The backtracking limit, as specified by the match_limit field in a
3555 pcre_extra structure (or defaulted) was reached. See the description
3556 above.
3557
3558 PCRE_ERROR_CALLOUT (-9)
3559
3560 This error is never generated by pcre_exec() itself. It is provided for
3561 use by callout functions that want to yield a distinctive error code.
3562 See the pcrecallout documentation for details.
3563
3564 PCRE_ERROR_BADUTF8 (-10)
3565
3566 A string that contains an invalid UTF-8 byte sequence was passed as a
3567 subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
3568 the output vector (ovecsize) is at least 2, the byte offset to the
3569 start of the the invalid UTF-8 character is placed in the first ele-
3570 ment, and a reason code is placed in the second element. The reason
3571 codes are listed in the following section. For backward compatibility,
3572 if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
3573 acter at the end of the subject (reason codes 1 to 5),
3574 PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
3575
3576 PCRE_ERROR_BADUTF8_OFFSET (-11)
3577
3578 The UTF-8 byte sequence that was passed as a subject was checked and
3579 found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
3580 value of startoffset did not point to the beginning of a UTF-8 charac-
3581 ter or the end of the subject.
3582
3583 PCRE_ERROR_PARTIAL (-12)
3584
3585 The subject string did not match, but it did match partially. See the
3586 pcrepartial documentation for details of partial matching.
3587
3588 PCRE_ERROR_BADPARTIAL (-13)
3589
3590 This code is no longer in use. It was formerly returned when the
3591 PCRE_PARTIAL option was used with a compiled pattern containing items
3592 that were not supported for partial matching. From release 8.00
3593 onwards, there are no restrictions on partial matching.
3594
3595 PCRE_ERROR_INTERNAL (-14)
3596
3597 An unexpected internal error has occurred. This error could be caused
3598 by a bug in PCRE or by overwriting of the compiled pattern.
3599
3600 PCRE_ERROR_BADCOUNT (-15)
3601
3602 This error is given if the value of the ovecsize argument is negative.
3603
3604 PCRE_ERROR_RECURSIONLIMIT (-21)
3605
3606 The internal recursion limit, as specified by the match_limit_recursion
3607 field in a pcre_extra structure (or defaulted) was reached. See the
3608 description above.
3609
3610 PCRE_ERROR_BADNEWLINE (-23)
3611
3612 An invalid combination of PCRE_NEWLINE_xxx options was given.
3613
3614 PCRE_ERROR_BADOFFSET (-24)
3615
3616 The value of startoffset was negative or greater than the length of the
3617 subject, that is, the value in length.
3618
3619 PCRE_ERROR_SHORTUTF8 (-25)
3620
3621 This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
3622 string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
3623 option is set. Information about the failure is returned as for
3624 PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
3625 this special error code for PCRE_PARTIAL_HARD precedes the implementa-
3626 tion of returned information; it is retained for backwards compatibil-
3627 ity.
3628
3629 PCRE_ERROR_RECURSELOOP (-26)
3630
3631 This error is returned when pcre_exec() detects a recursion loop within
3632 the pattern. Specifically, it means that either the whole pattern or a
3633 subpattern has been called recursively for the second time at the same
3634 position in the subject string. Some simple patterns that might do this
3635 are detected and faulted at compile time, but more complicated cases,
3636 in particular mutual recursions between two different subpatterns, can-
3637 not be detected until run time.
3638
3639 PCRE_ERROR_JIT_STACKLIMIT (-27)
3640
3641 This error is returned when a pattern that was successfully studied
3642 using a JIT compile option is being matched, but the memory available
3643 for the just-in-time processing stack is not large enough. See the
3644 pcrejit documentation for more details.
3645
3646 PCRE_ERROR_BADMODE (-28)
3647
3648 This error is given if a pattern that was compiled by the 8-bit library
3649 is passed to a 16-bit or 32-bit library function, or vice versa.
3650
3651 PCRE_ERROR_BADENDIANNESS (-29)
3652
3653 This error is given if a pattern that was compiled and saved is
3654 reloaded on a host with different endianness. The utility function
3655 pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3656 so that it runs on the new host.
3657
3658 PCRE_ERROR_JIT_BADOPTION
3659
3660 This error is returned when a pattern that was successfully studied
3661 using a JIT compile option is being matched, but the matching mode
3662 (partial or complete match) does not correspond to any JIT compilation
3663 mode. When the JIT fast path function is used, this error may be also
3664 given for invalid options. See the pcrejit documentation for more
3665 details.
3666
3667 PCRE_ERROR_BADLENGTH (-32)
3668
3669 This error is given if pcre_exec() is called with a negative value for
3670 the length argument.
3671
3672 Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
3673
3674 Reason codes for invalid UTF-8 strings
3675
3676 This section applies only to the 8-bit library. The corresponding
3677 information for the 16-bit and 32-bit libraries is given in the pcre16
3678 and pcre32 pages.
3679
3680 When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3681 UTF8, and the size of the output vector (ovecsize) is at least 2, the
3682 offset of the start of the invalid UTF-8 character is placed in the
3683 first output vector element (ovector[0]) and a reason code is placed in
3684 the second element (ovector[1]). The reason codes are given names in
3685 the pcre.h header file:
3686
3687 PCRE_UTF8_ERR1
3688 PCRE_UTF8_ERR2
3689 PCRE_UTF8_ERR3
3690 PCRE_UTF8_ERR4
3691 PCRE_UTF8_ERR5
3692
3693 The string ends with a truncated UTF-8 character; the code specifies
3694 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
3695 characters to be no longer than 4 bytes, the encoding scheme (origi-
3696 nally defined by RFC 2279) allows for up to 6 bytes, and this is
3697 checked first; hence the possibility of 4 or 5 missing bytes.
3698
3699 PCRE_UTF8_ERR6
3700 PCRE_UTF8_ERR7
3701 PCRE_UTF8_ERR8
3702 PCRE_UTF8_ERR9
3703 PCRE_UTF8_ERR10
3704
3705 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3706 the character do not have the binary value 0b10 (that is, either the
3707 most significant bit is 0, or the next bit is 1).
3708
3709 PCRE_UTF8_ERR11
3710 PCRE_UTF8_ERR12
3711
3712 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
3713 long; these code points are excluded by RFC 3629.
3714
3715 PCRE_UTF8_ERR13
3716
3717 A 4-byte character has a value greater than 0x10fff; these code points
3718 are excluded by RFC 3629.
3719
3720 PCRE_UTF8_ERR14
3721
3722 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
3723 range of code points are reserved by RFC 3629 for use with UTF-16, and
3724 so are excluded from UTF-8.
3725
3726 PCRE_UTF8_ERR15
3727 PCRE_UTF8_ERR16
3728 PCRE_UTF8_ERR17
3729 PCRE_UTF8_ERR18
3730 PCRE_UTF8_ERR19
3731
3732 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
3733 for a value that can be represented by fewer bytes, which is invalid.
3734 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
3735 rect coding uses just one byte.
3736
3737 PCRE_UTF8_ERR20
3738
3739 The two most significant bits of the first byte of a character have the
3740 binary value 0b10 (that is, the most significant bit is 1 and the sec-
3741 ond is 0). Such a byte can only validly occur as the second or subse-
3742 quent byte of a multi-byte character.
3743
3744 PCRE_UTF8_ERR21
3745
3746 The first byte of a character has the value 0xfe or 0xff. These values
3747 can never occur in a valid UTF-8 string.
3748
3749 PCRE_UTF8_ERR22
3750
3751 This error code was formerly used when the presence of a so-called
3752 "non-character" caused an error. Unicode corrigendum #9 makes it clear
3753 that such characters should not cause a string to be rejected, and so
3754 this code is no longer in use and is never returned.
3755
3756
3757 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3758
3759 int pcre_copy_substring(const char *subject, int *ovector,
3760 int stringcount, int stringnumber, char *buffer,
3761 int buffersize);
3762
3763 int pcre_get_substring(const char *subject, int *ovector,
3764 int stringcount, int stringnumber,
3765 const char **stringptr);
3766
3767 int pcre_get_substring_list(const char *subject,
3768 int *ovector, int stringcount, const char ***listptr);
3769
3770 Captured substrings can be accessed directly by using the offsets
3771 returned by pcre_exec() in ovector. For convenience, the functions
3772 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
3773 string_list() are provided for extracting captured substrings as new,
3774 separate, zero-terminated strings. These functions identify substrings
3775 by number. The next section describes functions for extracting named
3776 substrings.
3777
3778 A substring that contains a binary zero is correctly extracted and has
3779 a further zero added on the end, but the result is not, of course, a C
3780 string. However, you can process such a string by referring to the
3781 length that is returned by pcre_copy_substring() and pcre_get_sub-
3782 string(). Unfortunately, the interface to pcre_get_substring_list() is
3783 not adequate for handling strings containing binary zeros, because the
3784 end of the final string is not independently indicated.
3785
3786 The first three arguments are the same for all three of these func-
3787 tions: subject is the subject string that has just been successfully
3788 matched, ovector is a pointer to the vector of integer offsets that was
3789 passed to pcre_exec(), and stringcount is the number of substrings that
3790 were captured by the match, including the substring that matched the
3791 entire regular expression. This is the value returned by pcre_exec() if
3792 it is greater than zero. If pcre_exec() returned zero, indicating that
3793 it ran out of space in ovector, the value passed as stringcount should
3794 be the number of elements in the vector divided by three.
3795
3796 The functions pcre_copy_substring() and pcre_get_substring() extract a
3797 single substring, whose number is given as stringnumber. A value of
3798 zero extracts the substring that matched the entire pattern, whereas
3799 higher values extract the captured substrings. For pcre_copy_sub-
3800 string(), the string is placed in buffer, whose length is given by
3801 buffersize, while for pcre_get_substring() a new block of memory is
3802 obtained via pcre_malloc, and its address is returned via stringptr.
3803 The yield of the function is the length of the string, not including
3804 the terminating zero, or one of these error codes:
3805
3806 PCRE_ERROR_NOMEMORY (-6)
3807
3808 The buffer was too small for pcre_copy_substring(), or the attempt to
3809 get memory failed for pcre_get_substring().
3810
3811 PCRE_ERROR_NOSUBSTRING (-7)
3812
3813 There is no substring whose number is stringnumber.
3814
3815 The pcre_get_substring_list() function extracts all available sub-
3816 strings and builds a list of pointers to them. All this is done in a
3817 single block of memory that is obtained via pcre_malloc. The address of
3818 the memory block is returned via listptr, which is also the start of
3819 the list of string pointers. The end of the list is marked by a NULL
3820 pointer. The yield of the function is zero if all went well, or the
3821 error code
3822
3823 PCRE_ERROR_NOMEMORY (-6)
3824
3825 if the attempt to get the memory block failed.
3826
3827 When any of these functions encounter a substring that is unset, which
3828 can happen when capturing subpattern number n+1 matches some part of
3829 the subject, but subpattern n has not been used at all, they return an
3830 empty string. This can be distinguished from a genuine zero-length sub-
3831 string by inspecting the appropriate offset in ovector, which is nega-
3832 tive for unset substrings.
3833
3834 The two convenience functions pcre_free_substring() and pcre_free_sub-
3835 string_list() can be used to free the memory returned by a previous
3836 call of pcre_get_substring() or pcre_get_substring_list(), respec-
3837 tively. They do nothing more than call the function pointed to by
3838 pcre_free, which of course could be called directly from a C program.
3839 However, PCRE is used in some situations where it is linked via a spe-
3840 cial interface to another programming language that cannot use
3841 pcre_free directly; it is for these cases that the functions are pro-
3842 vided.
3843
3844
3845 EXTRACTING CAPTURED SUBSTRINGS BY NAME
3846
3847 int pcre_get_stringnumber(const pcre *code,
3848 const char *name);
3849
3850 int pcre_copy_named_substring(const pcre *code,
3851 const char *subject, int *ovector,
3852 int stringcount, const char *stringname,
3853 char *buffer, int buffersize);
3854
3855 int pcre_get_named_substring(const pcre *code,
3856 const char *subject, int *ovector,
3857 int stringcount, const char *stringname,
3858 const char **stringptr);
3859
3860 To extract a substring by name, you first have to find associated num-
3861 ber. For example, for this pattern
3862
3863 (a+)b(?<xxx>\d+)...
3864
3865 the number of the subpattern called "xxx" is 2. If the name is known to
3866 be unique (PCRE_DUPNAMES was not set), you can find the number from the
3867 name by calling pcre_get_stringnumber(). The first argument is the com-
3868 piled pattern, and the second is the name. The yield of the function is
3869 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
3870 subpattern of that name.
3871
3872 Given the number, you can extract the substring directly, or use one of
3873 the functions described in the previous section. For convenience, there
3874 are also two functions that do the whole job.
3875
3876 Most of the arguments of pcre_copy_named_substring() and
3877 pcre_get_named_substring() are the same as those for the similarly
3878 named functions that extract by number. As these are described in the
3879 previous section, they are not re-described here. There are just two
3880 differences:
3881
3882 First, instead of a substring number, a substring name is given. Sec-
3883 ond, there is an extra argument, given at the start, which is a pointer
3884 to the compiled pattern. This is needed in order to gain access to the
3885 name-to-number translation table.
3886
3887 These functions call pcre_get_stringnumber(), and if it succeeds, they
3888 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
3889 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
3890 behaviour may not be what you want (see the next section).
3891
3892 Warning: If the pattern uses the (?| feature to set up multiple subpat-
3893 terns with the same number, as described in the section on duplicate
3894 subpattern numbers in the pcrepattern page, you cannot use names to
3895 distinguish the different subpatterns, because names are not included
3896 in the compiled code. The matching process uses only numbers. For this
3897 reason, the use of different names for subpatterns of the same number
3898 causes an error at compile time.
3899
3900
3901 DUPLICATE SUBPATTERN NAMES
3902
3903 int pcre_get_stringtable_entries(const pcre *code,
3904 const char *name, char **first, char **last);
3905
3906 When a pattern is compiled with the PCRE_DUPNAMES option, names for
3907 subpatterns are not required to be unique. (Duplicate names are always
3908 allowed for subpatterns with the same number, created by using the (?|
3909 feature. Indeed, if such subpatterns are named, they are required to
3910 use the same names.)
3911
3912 Normally, patterns with duplicate names are such that in any one match,
3913 only one of the named subpatterns participates. An example is shown in
3914 the pcrepattern documentation.
3915
3916 When duplicates are present, pcre_copy_named_substring() and
3917 pcre_get_named_substring() return the first substring corresponding to
3918 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
3919 (-7) is returned; no data is returned. The pcre_get_stringnumber()
3920 function returns one of the numbers that are associated with the name,
3921 but it is not defined which it is.
3922
3923 If you want to get full details of all captured substrings for a given
3924 name, you must use the pcre_get_stringtable_entries() function. The
3925 first argument is the compiled pattern, and the second is the name. The
3926 third and fourth are pointers to variables which are updated by the
3927 function. After it has run, they point to the first and last entries in
3928 the name-to-number table for the given name. The function itself
3929 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3930 there are none. The format of the table is described above in the sec-
3931 tion entitled Information about a pattern above. Given all the rele-
3932 vant entries for the name, you can extract each of their numbers, and
3933 hence the captured data, if any.
3934
3935
3936 FINDING ALL POSSIBLE MATCHES
3937
3938 The traditional matching function uses a similar algorithm to Perl,
3939 which stops when it finds the first match, starting at a given point in
3940 the subject. If you want to find all possible matches, or the longest
3941 possible match, consider using the alternative matching function (see
3942 below) instead. If you cannot use the alternative function, but still
3943 need to find all possible matches, you can kludge it up by making use
3944 of the callout facility, which is described in the pcrecallout documen-
3945 tation.
3946
3947 What you have to do is to insert a callout right at the end of the pat-
3948 tern. When your callout function is called, extract and save the cur-
3949 rent matched substring. Then return 1, which forces pcre_exec() to
3950 backtrack and try other alternatives. Ultimately, when it runs out of
3951 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3952
3953
3954 OBTAINING AN ESTIMATE OF STACK USAGE
3955
3956 Matching certain patterns using pcre_exec() can use a lot of process
3957 stack, which in certain environments can be rather limited in size.
3958 Some users find it helpful to have an estimate of the amount of stack
3959 that is used by pcre_exec(), to help them set recursion limits, as
3960 described in the pcrestack documentation. The estimate that is output
3961 by pcretest when called with the -m and -C options is obtained by call-
3962 ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
3963 first five arguments.
3964
3965 Normally, if its first argument is NULL, pcre_exec() immediately
3966 returns the negative error code PCRE_ERROR_NULL, but with this special
3967 combination of arguments, it returns instead a negative number whose
3968 absolute value is the approximate stack frame size in bytes. (A nega-
3969 tive number is used so that it is clear that no match has happened.)
3970 The value is approximate because in some cases, recursive calls to
3971 pcre_exec() occur when there are one or two additional variables on the
3972 stack.
3973
3974 If PCRE has been compiled to use the heap instead of the stack for
3975 recursion, the value returned is the size of each block that is
3976 obtained from the heap.
3977
3978
3979 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3980
3981 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
3982 const char *subject, int length, int startoffset,
3983 int options, int *ovector, int ovecsize,
3984 int *workspace, int wscount);
3985
3986 The function pcre_dfa_exec() is called to match a subject string
3987 against a compiled pattern, using a matching algorithm that scans the
3988 subject string just once, and does not backtrack. This has different
3989 characteristics to the normal algorithm, and is not compatible with
3990 Perl. Some of the features of PCRE patterns are not supported. Never-
3991 theless, there are times when this kind of matching can be useful. For
3992 a discussion of the two matching algorithms, and a list of features
3993 that pcre_dfa_exec() does not support, see the pcrematching documenta-
3994 tion.
3995
3996 The arguments for the pcre_dfa_exec() function are the same as for
3997 pcre_exec(), plus two extras. The ovector argument is used in a differ-
3998 ent way, and this is described below. The other common arguments are
3999 used in the same way as for pcre_exec(), so their description is not
4000 repeated here.
4001
4002 The two additional arguments provide workspace for the function. The
4003 workspace vector should contain at least 20 elements. It is used for
4004 keeping track of multiple paths through the pattern tree. More
4005 workspace will be needed for patterns and subjects where there are a
4006 lot of potential matches.
4007
4008 Here is an example of a simple call to pcre_dfa_exec():
4009
4010 int rc;
4011 int ovector[10];
4012 int wspace[20];
4013 rc = pcre_dfa_exec(
4014 re, /* result of pcre_compile() */
4015 NULL, /* we didn't study the pattern */
4016 "some string", /* the subject string */
4017 11, /* the length of the subject string */
4018 0, /* start at offset 0 in the subject */
4019 0, /* default options */
4020 ovector, /* vector of integers for substring information */
4021 10, /* number of elements (NOT size in bytes) */
4022 wspace, /* working space vector */
4023 20); /* number of elements (NOT size in bytes) */
4024
4025 Option bits for pcre_dfa_exec()
4026
4027 The unused bits of the options argument for pcre_dfa_exec() must be
4028 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
4029 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
4030 PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
4031 PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
4032 TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
4033 four of these are exactly the same as for pcre_exec(), so their
4034 description is not repeated here.
4035
4036 PCRE_PARTIAL_HARD
4037 PCRE_PARTIAL_SOFT
4038
4039 These have the same general effect as they do for pcre_exec(), but the
4040 details are slightly different. When PCRE_PARTIAL_HARD is set for
4041 pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
4042 ject is reached and there is still at least one matching possibility
4043 that requires additional characters. This happens even if some complete
4044 matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
4045 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
4046 of the subject is reached, there have been no complete matches, but
4047 there is still at least one matching possibility. The portion of the
4048 string that was inspected when the longest partial match was found is
4049 set as the first matching string in both cases. There is a more
4050 detailed discussion of partial and multi-segment matching, with exam-
4051 ples, in the pcrepartial documentation.
4052
4053 PCRE_DFA_SHORTEST
4054
4055 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
4056 stop as soon as it has found one match. Because of the way the alterna-
4057 tive algorithm works, this is necessarily the shortest possible match
4058 at the first possible matching point in the subject string.
4059
4060 PCRE_DFA_RESTART
4061
4062 When pcre_dfa_exec() returns a partial match, it is possible to call it
4063 again, with additional subject characters, and have it continue with
4064 the same match. The PCRE_DFA_RESTART option requests this action; when
4065 it is set, the workspace and wscount options must reference the same
4066 vector as before because data about the match so far is left in them
4067 after a partial match. There is more discussion of this facility in the
4068 pcrepartial documentation.
4069
4070 Successful returns from pcre_dfa_exec()
4071
4072 When pcre_dfa_exec() succeeds, it may have matched more than one sub-
4073 string in the subject. Note, however, that all the matches from one run
4074 of the function start at the same point in the subject. The shorter
4075 matches are all initial substrings of the longer matches. For example,
4076 if the pattern
4077
4078 <.*>
4079
4080 is matched against the string
4081
4082 This is <something> <something else> <something further> no more
4083
4084 the three matched strings are
4085
4086 <something>
4087 <something> <something else>
4088 <something> <something else> <something further>
4089
4090 On success, the yield of the function is a number greater than zero,
4091 which is the number of matched substrings. The substrings themselves
4092 are returned in ovector. Each string uses two elements; the first is
4093 the offset to the start, and the second is the offset to the end. In
4094 fact, all the strings have the same start offset. (Space could have
4095 been saved by giving this only once, but it was decided to retain some
4096 compatibility with the way pcre_exec() returns data, even though the
4097 meaning of the strings is different.)
4098
4099 The strings are returned in reverse order of length; that is, the long-
4100 est matching string is given first. If there were too many matches to
4101 fit into ovector, the yield of the function is zero, and the vector is
4102 filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
4103 can use the entire ovector for returning matched strings.
4104
4105 Error returns from pcre_dfa_exec()
4106
4107 The pcre_dfa_exec() function returns a negative number when it fails.
4108 Many of the errors are the same as for pcre_exec(), and these are
4109 described above. There are in addition the following errors that are
4110 specific to pcre_dfa_exec():
4111
4112 PCRE_ERROR_DFA_UITEM (-16)
4113
4114 This return is given if pcre_dfa_exec() encounters an item in the pat-
4115 tern that it does not support, for instance, the use of \C or a back
4116 reference.
4117
4118 PCRE_ERROR_DFA_UCOND (-17)
4119
4120 This return is given if pcre_dfa_exec() encounters a condition item
4121 that uses a back reference for the condition, or a test for recursion
4122 in a specific group. These are not supported.
4123
4124 PCRE_ERROR_DFA_UMLIMIT (-18)
4125
4126 This return is given if pcre_dfa_exec() is called with an extra block
4127 that contains a setting of the match_limit or match_limit_recursion
4128 fields. This is not supported (these fields are meaningless for DFA
4129 matching).
4130
4131 PCRE_ERROR_DFA_WSSIZE (-19)
4132
4133 This return is given if pcre_dfa_exec() runs out of space in the
4134 workspace vector.
4135
4136 PCRE_ERROR_DFA_RECURSE (-20)
4137
4138 When a recursive subpattern is processed, the matching function calls
4139 itself recursively, using private vectors for ovector and workspace.
4140 This error is given if the output vector is not large enough. This
4141 should be extremely rare, as a vector of size 1000 is used.
4142
4143 PCRE_ERROR_DFA_BADRESTART (-30)
4144
4145 When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
4146 plausibility checks are made on the contents of the workspace, which
4147 should contain data about the previous partial match. If any of these
4148 checks fail, this error is given.
4149
4150
4151 SEE ALSO
4152
4153 pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
4154 pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
4155 sample(3), pcrestack(3).
4156
4157
4158 AUTHOR
4159
4160 Philip Hazel
4161 University Computing Service
4162 Cambridge CB2 3QH, England.
4163
4164
4165 REVISION
4166
4167 Last updated: 26 April 2013
4168 Copyright (c) 1997-2013 University of Cambridge.
4169 ------------------------------------------------------------------------------
4170
4171
4172 PCRECALLOUT(3) Library Functions Manual PCRECALLOUT(3)
4173
4174
4175
4176 NAME
4177 PCRE - Perl-compatible regular expressions
4178
4179 SYNOPSIS
4180
4181 #include <pcre.h>
4182
4183 int (*pcre_callout)(pcre_callout_block *);
4184
4185 int (*pcre16_callout)(pcre16_callout_block *);
4186
4187 int (*pcre32_callout)(pcre32_callout_block *);
4188
4189
4190 DESCRIPTION
4191
4192 PCRE provides a feature called "callout", which is a means of temporar-
4193 ily passing control to the caller of PCRE in the middle of pattern
4194 matching. The caller of PCRE provides an external function by putting
4195 its entry point in the global variable pcre_callout (pcre16_callout for
4196 the 16-bit library, pcre32_callout for the 32-bit library). By default,
4197 this variable contains NULL, which disables all calling out.
4198
4199 Within a regular expression, (?C) indicates the points at which the
4200 external function is to be called. Different callout points can be
4201 identified by putting a number less than 256 after the letter C. The
4202 default value is zero. For example, this pattern has two callout
4203 points:
4204
4205 (?C1)abc(?C2)def
4206
4207 If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled,
4208 PCRE automatically inserts callouts, all with number 255, before each
4209 item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
4210 pattern
4211
4212 A(\d{2}|--)
4213
4214 it is processed as if it were
4215
4216 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4217
4218 Notice that there is a callout before and after each parenthesis and
4219 alternation bar. If the pattern contains a conditional group whose con-
4220 dition is an assertion, an automatic callout is inserted immediately
4221 before the condition. Such a callout may also be inserted explicitly,
4222 for example:
4223
4224 (?(?C9)(?=a)ab|de)
4225
4226 This applies only to assertion conditions (because they are themselves
4227 independent groups).
4228
4229 Automatic callouts can be used for tracking the progress of pattern
4230 matching. The pcretest command has an option that sets automatic call-
4231 outs; when it is used, the output indicates how the pattern is matched.
4232 This is useful information when you are trying to optimize the perfor-
4233 mance of a particular pattern.
4234
4235
4236 MISSING CALLOUTS
4237
4238 You should be aware that, because of optimizations in the way PCRE
4239 matches patterns by default, callouts sometimes do not happen. For
4240 example, if the pattern is
4241
4242 ab(?C4)cd
4243
4244 PCRE knows that any matching string must contain the letter "d". If the
4245 subject string is "abyz", the lack of "d" means that matching doesn't
4246 ever start, and the callout is never reached. However, with "abyd",
4247 though the result is still no match, the callout is obeyed.
4248
4249 If the pattern is studied, PCRE knows the minimum length of a matching
4250 string, and will immediately give a "no match" return without actually
4251 running a match if the subject is not long enough, or, for unanchored
4252 patterns, if it has been scanned far enough.
4253
4254 You can disable these optimizations by passing the PCRE_NO_START_OPTI-
4255 MIZE option to the matching function, or by starting the pattern with
4256 (*NO_START_OPT). This slows down the matching process, but does ensure
4257 that callouts such as the example above are obeyed.
4258
4259
4260 THE CALLOUT INTERFACE
4261
4262 During matching, when PCRE reaches a callout point, the external func-
4263 tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
4264 set). This applies to both normal and DFA matching. The only argument
4265 to the callout function is a pointer to a pcre_callout or
4266 pcre[16|32]_callout block. These structures contains the following
4267 fields:
4268
4269 int version;
4270 int callout_number;
4271 int *offset_vector;
4272 const char *subject; (8-bit version)
4273 PCRE_SPTR16 subject; (16-bit version)
4274 PCRE_SPTR32 subject; (32-bit version)
4275 int subject_length;
4276 int start_match;
4277 int current_position;
4278 int capture_top;
4279 int capture_last;
4280 void *callout_data;
4281 int pattern_position;
4282 int next_item_length;
4283 const unsigned char *mark; (8-bit version)
4284 const PCRE_UCHAR16 *mark; (16-bit version)
4285 const PCRE_UCHAR32 *mark; (32-bit version)
4286
4287 The version field is an integer containing the version number of the
4288 block format. The initial version was 0; the current version is 2. The
4289 version number will change again in future if additional fields are
4290 added, but the intention is never to remove any of the existing fields.
4291
4292 The callout_number field contains the number of the callout, as com-
4293 piled into the pattern (that is, the number after ?C for manual call-
4294 outs, and 255 for automatically generated callouts).
4295
4296 The offset_vector field is a pointer to the vector of offsets that was
4297 passed by the caller to the matching function. When pcre_exec() or
4298 pcre[16|32]_exec() is used, the contents can be inspected, in order to
4299 extract substrings that have been matched so far, in the same way as
4300 for extracting substrings after a match has completed. For the DFA
4301 matching functions, this field is not useful.
4302
4303 The subject and subject_length fields contain copies of the values that
4304 were passed to the matching function.
4305
4306 The start_match field normally contains the offset within the subject
4307 at which the current match attempt started. However, if the escape
4308 sequence \K has been encountered, this value is changed to reflect the
4309 modified starting point. If the pattern is not anchored, the callout
4310 function may be called several times from the same point in the pattern
4311 for different starting points in the subject.
4312
4313 The current_position field contains the offset within the subject of
4314 the current match pointer.
4315
4316 When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top
4317 field contains one more than the number of the highest numbered cap-
4318 tured substring so far. If no substrings have been captured, the value
4319 of capture_top is one. This is always the case when the DFA functions
4320 are used, because they do not support captured substrings.
4321
4322 The capture_last field contains the number of the most recently cap-
4323 tured substring. However, when a recursion exits, the value reverts to
4324 what it was outside the recursion, as do the values of all captured
4325 substrings. If no substrings have been captured, the value of cap-
4326 ture_last is -1. This is always the case for the DFA matching func-
4327 tions.
4328
4329 The callout_data field contains a value that is passed to a matching
4330 function specifically so that it can be passed back in callouts. It is
4331 passed in the callout_data field of a pcre_extra or pcre[16|32]_extra
4332 data structure. If no such data was passed, the value of callout_data
4333 in a callout block is NULL. There is a description of the pcre_extra
4334 structure in the pcreapi documentation.
4335
4336 The pattern_position field is present from version 1 of the callout
4337 structure. It contains the offset to the next item to be matched in the
4338 pattern string.
4339
4340 The next_item_length field is present from version 1 of the callout
4341 structure. It contains the length of the next item to be matched in the
4342 pattern string. When the callout immediately precedes an alternation
4343 bar, a closing parenthesis, or the end of the pattern, the length is
4344 zero. When the callout precedes an opening parenthesis, the length is
4345 that of the entire subpattern.
4346
4347 The pattern_position and next_item_length fields are intended to help
4348 in distinguishing between different automatic callouts, which all have
4349 the same callout number. However, they are set for all callouts.
4350
4351 The mark field is present from version 2 of the callout structure. In
4352 callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer
4353 to the zero-terminated name of the most recently passed (*MARK),
4354 (*PRUNE), or (*THEN) item in the match, or NULL if no such items have
4355 been passed. Instances of (*PRUNE) or (*THEN) without a name do not
4356 obliterate a previous (*MARK). In callouts from the DFA matching func-
4357 tions this field always contains NULL.
4358
4359
4360 RETURN VALUES
4361
4362 The external callout function returns an integer to PCRE. If the value
4363 is zero, matching proceeds as normal. If the value is greater than
4364 zero, matching fails at the current point, but the testing of other
4365 matching possibilities goes ahead, just as if a lookahead assertion had
4366 failed. If the value is less than zero, the match is abandoned, the
4367 matching function returns the negative value.
4368
4369 Negative values should normally be chosen from the set of
4370 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4371 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
4372 reserved for use by callout functions; it will never be used by PCRE
4373 itself.
4374
4375
4376 AUTHOR
4377
4378 Philip Hazel
4379 University Computing Service
4380 Cambridge CB2 3QH, England.
4381
4382
4383 REVISION
4384
4385 Last updated: 03 March 2013
4386 Copyright (c) 1997-2013 University of Cambridge.
4387 ------------------------------------------------------------------------------
4388
4389
4390 PCRECOMPAT(3) Library Functions Manual PCRECOMPAT(3)
4391
4392
4393
4394 NAME
4395 PCRE - Perl-compatible regular expressions
4396
4397 DIFFERENCES BETWEEN PCRE AND PERL
4398
4399 This document describes the differences in the ways that PCRE and Perl
4400 handle regular expressions. The differences described here are with
4401 respect to Perl versions 5.10 and above.
4402
4403 1. PCRE has only a subset of Perl's Unicode support. Details of what it
4404 does have are given in the pcreunicode page.
4405
4406 2. PCRE allows repeat quantifiers only on parenthesized assertions, but
4407 they do not mean what you might think. For example, (?!a){3} does not
4408 assert that the next three characters are not "a". It just asserts that
4409 the next character is not "a" three times (in principle: PCRE optimizes
4410 this to run the assertion just once). Perl allows repeat quantifiers on
4411 other assertions such as \b, but these do not seem to have any use.
4412
4413 3. Capturing subpatterns that occur inside negative lookahead asser-
4414 tions are counted, but their entries in the offsets vector are never
4415 set. Perl sometimes (but not always) sets its numerical variables from
4416 inside negative assertions.
4417
4418 4. Though binary zero characters are supported in the subject string,
4419 they are not allowed in a pattern string because it is passed as a nor-
4420 mal C string, terminated by zero. The escape sequence \0 can be used in
4421 the pattern to represent a binary zero.
4422
4423 5. The following Perl escape sequences are not supported: \l, \u, \L,
4424 \U, and \N when followed by a character name or Unicode value. (\N on
4425 its own, matching a non-newline character, is supported.) In fact these
4426 are implemented by Perl's general string-handling and are not part of
4427 its pattern matching engine. If any of these are encountered by PCRE,
4428 an error is generated by default. However, if the PCRE_JAVASCRIPT_COM-
4429 PAT option is set, \U and \u are interpreted as JavaScript interprets
4430 them.
4431
4432 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
4433 is built with Unicode character property support. The properties that
4434 can be tested with \p and \P are limited to the general category prop-
4435 erties such as Lu and Nd, script names such as Greek or Han, and the
4436 derived properties Any and L&. PCRE does support the Cs (surrogate)
4437 property, which Perl does not; the Perl documentation says "Because
4438 Perl hides the need for the user to understand the internal representa-
4439 tion of Unicode characters, there is no need to implement the somewhat
4440 messy concept of surrogates."
4441
4442 7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
4443 ters in between are treated as literals. This is slightly different
4444 from Perl in that $ and @ are also handled as literals inside the
4445 quotes. In Perl, they cause variable interpolation (but of course PCRE
4446 does not have variables). Note the following examples:
4447
4448 Pattern PCRE matches Perl matches
4449
4450 \Qabc$xyz\E abc$xyz abc followed by the
4451 contents of $xyz
4452 \Qabc\$xyz\E abc\$xyz abc\$xyz
4453 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
4454
4455 The \Q...\E sequence is recognized both inside and outside character
4456 classes.
4457
4458 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
4459 constructions. However, there is support for recursive patterns. This
4460 is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
4461 "callout" feature allows an external function to be called during pat-
4462 tern matching. See the pcrecallout documentation for details.
4463
4464 9. Subpatterns that are called as subroutines (whether or not recur-
4465 sively) are always treated as atomic groups in PCRE. This is like
4466 Python, but unlike Perl. Captured values that are set outside a sub-
4467 routine call can be reference from inside in PCRE, but not in Perl.
4468 There is a discussion that explains these differences in more detail in
4469 the section on recursion differences from Perl in the pcrepattern page.
4470
4471 10. If any of the backtracking control verbs are used in a subpattern
4472 that is called as a subroutine (whether or not recursively), their
4473 effect is confined to that subpattern; it does not extend to the sur-
4474 rounding pattern. This is not always the case in Perl. In particular,
4475 if (*THEN) is present in a group that is called as a subroutine, its
4476 action is limited to that group, even if the group does not contain any
4477 | characters. Note that such subpatterns are processed as anchored at
4478 the point where they are tested.
4479
4480 11. If a pattern contains more than one backtracking control verb, the
4481 first one that is backtracked onto acts. For example, in the pattern
4482 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
4483 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4484 it is the same as PCRE, but there are examples where it differs.
4485
4486 12. Most backtracking verbs in assertions have their normal actions.
4487 They are not confined to the assertion.
4488
4489 13. There are some differences that are concerned with the settings of
4490 captured strings when part of a pattern is repeated. For example,
4491 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
4492 unset, but in PCRE it is set to "b".
4493
4494 14. PCRE's handling of duplicate subpattern numbers and duplicate sub-
4495 pattern names is not as general as Perl's. This is a consequence of the
4496 fact the PCRE works internally just with numbers, using an external ta-
4497 ble to translate between numbers and names. In particular, a pattern
4498 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
4499 the same number but different names, is not supported, and causes an
4500 error at compile time. If it were allowed, it would not be possible to
4501 distinguish which parentheses matched, because both names map to cap-
4502 turing subpattern number 1. To avoid this confusing situation, an error
4503 is given at compile time.
4504
4505 15. Perl recognizes comments in some places that PCRE does not, for
4506 example, between the ( and ? at the start of a subpattern. If the /x
4507 modifier is set, Perl allows white space between ( and ? but PCRE never
4508 does, even if the PCRE_EXTENDED option is set.
4509
4510 16. In PCRE, the upper/lower case character properties Lu and Ll are
4511 not affected when case-independent matching is specified. For example,
4512 \p{Lu} always matches an upper case letter. I think Perl has changed in
4513 this respect; in the release at the time of writing (5.16), \p{Lu} and
4514 \p{Ll} match all letters, regardless of case, when case independence is
4515 specified.
4516
4517 17. PCRE provides some extensions to the Perl regular expression facil-
4518 ities. Perl 5.10 includes new features that are not in earlier ver-
4519 sions of Perl, some of which (such as named parentheses) have been in
4520 PCRE for some time. This list is with respect to Perl 5.10:
4521
4522 (a) Although lookbehind assertions in PCRE must match fixed length
4523 strings, each alternative branch of a lookbehind assertion can match a
4524 different length of string. Perl requires them all to have the same
4525 length.
4526
4527 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
4528 meta-character matches only at the very end of the string.
4529
4530 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
4531 cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
4532 ignored. (Perl can be made to issue a warning.)
4533
4534 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
4535 fiers is inverted, that is, by default they are not greedy, but if fol-
4536 lowed by a question mark they are.
4537
4538 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
4539 tried only at the first matching position in the subject string.
4540
4541 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
4542 and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
4543 lents.
4544
4545 (g) The \R escape sequence can be restricted to match only CR, LF, or
4546 CRLF by the PCRE_BSR_ANYCRLF option.
4547
4548 (h) The callout facility is PCRE-specific.
4549
4550 (i) The partial matching facility is PCRE-specific.
4551
4552 (j) Patterns compiled by PCRE can be saved and re-used at a later time,
4553 even on different hosts that have the other endianness. However, this
4554 does not apply to optimized data created by the just-in-time compiler.
4555
4556 (k) The alternative matching functions (pcre_dfa_exec(),
4557 pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way and
4558 are not Perl-compatible.
4559
4560 (l) PCRE recognizes some special sequences such as (*CR) at the start
4561 of a pattern that set overall options that cannot be changed within the
4562 pattern.
4563
4564
4565 AUTHOR
4566
4567 Philip Hazel
4568 University Computing Service
4569 Cambridge CB2 3QH, England.
4570
4571
4572 REVISION
4573
4574 Last updated: 19 March 2013
4575 Copyright (c) 1997-2013 University of Cambridge.
4576 ------------------------------------------------------------------------------
4577
4578
4579 PCREPATTERN(3) Library Functions Manual PCREPATTERN(3)
4580
4581
4582
4583 NAME
4584 PCRE - Perl-compatible regular expressions
4585
4586 PCRE REGULAR EXPRESSION DETAILS
4587
4588 The syntax and semantics of the regular expressions that are supported
4589 by PCRE are described in detail below. There is a quick-reference syn-
4590 tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
4591 semantics as closely as it can. PCRE also supports some alternative
4592 regular expression syntax (which does not conflict with the Perl syn-
4593 tax) in order to provide some compatibility with regular expressions in
4594 Python, .NET, and Oniguruma.
4595
4596 Perl's regular expressions are described in its own documentation, and
4597 regular expressions in general are covered in a number of books, some
4598 of which have copious examples. Jeffrey Friedl's "Mastering Regular
4599 Expressions", published by O'Reilly, covers regular expressions in
4600 great detail. This description of PCRE's regular expressions is
4601 intended as reference material.
4602
4603 This document discusses the patterns that are supported by PCRE when
4604 one its main matching functions, pcre_exec() (8-bit) or
4605 pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has alternative
4606 matching functions, pcre_dfa_exec() and pcre[16|32_dfa_exec(), which
4607 match using a different algorithm that is not Perl-compatible. Some of
4608 the features discussed below are not available when DFA matching is
4609 used. The advantages and disadvantages of the alternative functions,
4610 and how they differ from the normal functions, are discussed in the
4611 pcrematching page.
4612
4613
4614 SPECIAL START-OF-PATTERN ITEMS
4615
4616 A number of options that can be passed to pcre_compile() can also be
4617 set by special items at the start of a pattern. These are not Perl-com-
4618 patible, but are provided to make these options accessible to pattern
4619 writers who are not able to change the program that processes the pat-
4620 tern. Any number of these items may appear, but they must all be
4621 together right at the start of the pattern string, and the letters must
4622 be in upper case.
4623
4624 UTF support
4625
4626 The original operation of PCRE was on strings of one-byte characters.
4627 However, there is now also support for UTF-8 strings in the original
4628 library, an extra library that supports 16-bit and UTF-16 character
4629 strings, and a third library that supports 32-bit and UTF-32 character
4630 strings. To use these features, PCRE must be built to include appropri-
4631 ate support. When using UTF strings you must either call the compiling
4632 function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the
4633 pattern must start with one of these special sequences:
4634
4635 (*UTF8)
4636 (*UTF16)
4637 (*UTF32)
4638 (*UTF)
4639
4640 (*UTF) is a generic sequence that can be used with any of the
4641 libraries. Starting a pattern with such a sequence is equivalent to
4642 setting the relevant option. How setting a UTF mode affects pattern
4643 matching is mentioned in several places below. There is also a summary
4644 of features in the pcreunicode page.
4645
4646 Some applications that allow their users to supply patterns may wish to
4647 restrict them to non-UTF data for security reasons. If the
4648 PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are not
4649 allowed, and their appearance causes an error.
4650
4651 Unicode property support
4652
4653 Another special sequence that may appear at the start of a pattern is
4654
4655 (*UCP)
4656
4657 This has the same effect as setting the PCRE_UCP option: it causes
4658 sequences such as \d and \w to use Unicode properties to determine
4659 character types, instead of recognizing only characters with codes less
4660 than 128 via a lookup table.
4661
4662 Disabling start-up optimizations
4663
4664 If a pattern starts with (*NO_START_OPT), it has the same effect as
4665 setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
4666 time.
4667
4668 Newline conventions
4669
4670 PCRE supports five different conventions for indicating line breaks in
4671 strings: a single CR (carriage return) character, a single LF (line-
4672 feed) character, the two-character sequence CRLF, any of the three pre-
4673 ceding, or any Unicode newline sequence. The pcreapi page has further
4674 discussion about newlines, and shows how to set the newline convention
4675 in the options arguments for the compiling and matching functions.
4676
4677 It is also possible to specify a newline convention by starting a pat-
4678 tern string with one of the following five sequences:
4679
4680 (*CR) carriage return
4681 (*LF) linefeed
4682 (*CRLF) carriage return, followed by linefeed
4683 (*ANYCRLF) any of the three above
4684 (*ANY) all Unicode newline sequences
4685
4686 These override the default and the options given to the compiling func-
4687 tion. For example, on a Unix system where LF is the default newline
4688 sequence, the pattern
4689
4690 (*CR)a.b
4691
4692 changes the convention to CR. That pattern matches "a\nb" because LF is
4693 no longer a newline. If more than one of these settings is present, the
4694 last one is used.
4695
4696 The newline convention affects where the circumflex and dollar asser-
4697 tions are true. It also affects the interpretation of the dot metachar-
4698 acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
4699 does not affect what the \R escape sequence matches. By default, this
4700 is any Unicode newline sequence, for Perl compatibility. However, this
4701 can be changed; see the description of \R in the section entitled "New-
4702 line sequences" below. A change of \R setting can be combined with a
4703 change of newline convention.
4704
4705 Setting match and recursion limits
4706
4707 The caller of pcre_exec() can set a limit on the number of times the
4708 internal match() function is called and on the maximum depth of recur-
4709 sive calls. These facilities are provided to catch runaway matches that
4710 are provoked by patterns with huge matching trees (a typical example is
4711 a pattern with nested unlimited repeats) and to avoid running out of
4712 system stack by too much recursion. When one of these limits is
4713 reached, pcre_exec() gives an error return. The limits can also be set
4714 by items at the start of the pattern of the form
4715
4716 (*LIMIT_MATCH=d)
4717 (*LIMIT_RECURSION=d)
4718
4719 where d is any number of decimal digits. However, the value of the set-
4720 ting must be less than the value set by the caller of pcre_exec() for
4721 it to have any effect. In other words, the pattern writer can lower the
4722 limit set by the programmer, but not raise it. If there is more than
4723 one setting of one of these limits, the lower value is used.
4724
4725
4726 EBCDIC CHARACTER CODES
4727
4728 PCRE can be compiled to run in an environment that uses EBCDIC as its
4729 character code rather than ASCII or Unicode (typically a mainframe sys-
4730 tem). In the sections below, character code values are ASCII or Uni-
4731 code; in an EBCDIC environment these characters may have different code
4732 values, and there are no code points greater than 255.
4733
4734
4735 CHARACTERS AND METACHARACTERS
4736
4737 A regular expression is a pattern that is matched against a subject
4738 string from left to right. Most characters stand for themselves in a
4739 pattern, and match the corresponding characters in the subject. As a
4740 trivial example, the pattern
4741
4742 The quick brown fox
4743
4744 matches a portion of a subject string that is identical to itself. When
4745 caseless matching is specified (the PCRE_CASELESS option), letters are
4746 matched independently of case. In a UTF mode, PCRE always understands
4747 the concept of case for characters whose values are less than 128, so
4748 caseless matching is always possible. For characters with higher val-
4749 ues, the concept of case is supported if PCRE is compiled with Unicode
4750 property support, but not otherwise. If you want to use caseless
4751 matching for characters 128 and above, you must ensure that PCRE is
4752 compiled with Unicode property support as well as with UTF support.
4753
4754 The power of regular expressions comes from the ability to include
4755 alternatives and repetitions in the pattern. These are encoded in the
4756 pattern by the use of metacharacters, which do not stand for themselves
4757 but instead are interpreted in some special way.
4758
4759 There are two different sets of metacharacters: those that are recog-
4760 nized anywhere in the pattern except within square brackets, and those
4761 that are recognized within square brackets. Outside square brackets,
4762 the metacharacters are as follows:
4763
4764 \ general escape character with several uses
4765 ^ assert start of string (or line, in multiline mode)
4766 $ assert end of string (or line, in multiline mode)
4767 . match any character except newline (by default)
4768 [ start character class definition
4769 | start of alternative branch
4770 ( start subpattern
4771 ) end subpattern
4772 ? extends the meaning of (
4773 also 0 or 1 quantifier
4774 also quantifier minimizer
4775 * 0 or more quantifier
4776 + 1 or more quantifier
4777 also "possessive quantifier"
4778 { start min/max quantifier
4779
4780 Part of a pattern that is in square brackets is called a "character
4781 class". In a character class the only metacharacters are:
4782
4783 \ general escape character
4784 ^ negate the class, but only if the first character
4785 - indicates character range
4786 [ POSIX character class (only if followed by POSIX
4787 syntax)
4788 ] terminates the character class
4789
4790 The following sections describe the use of each of the metacharacters.
4791
4792
4793 BACKSLASH
4794
4795 The backslash character has several uses. Firstly, if it is followed by
4796 a character that is not a number or a letter, it takes away any special
4797 meaning that character may have. This use of backslash as an escape
4798 character applies both inside and outside character classes.
4799
4800 For example, if you want to match a * character, you write \* in the
4801 pattern. This escaping action applies whether or not the following
4802 character would otherwise be interpreted as a metacharacter, so it is
4803 always safe to precede a non-alphanumeric with backslash to specify
4804 that it stands for itself. In particular, if you want to match a back-
4805 slash, you write \\.
4806
4807 In a UTF mode, only ASCII numbers and letters have any special meaning
4808 after a backslash. All other characters (in particular, those whose
4809 codepoints are greater than 127) are treated as literals.
4810
4811 If a pattern is compiled with the PCRE_EXTENDED option, white space in
4812 the pattern (other than in a character class) and characters between a
4813 # outside a character class and the next newline are ignored. An escap-
4814 ing backslash can be used to include a white space or # character as
4815 part of the pattern.
4816
4817 If you want to remove the special meaning from a sequence of charac-
4818 ters, you can do so by putting them between \Q and \E. This is differ-
4819 ent from Perl in that $ and @ are handled as literals in \Q...\E
4820 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
4821 tion. Note the following examples:
4822
4823 Pattern PCRE matches Perl matches
4824
4825 \Qabc$xyz\E abc$xyz abc followed by the
4826 contents of $xyz
4827 \Qabc\$xyz\E abc\$xyz abc\$xyz
4828 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
4829
4830 The \Q...\E sequence is recognized both inside and outside character
4831 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
4832 is not followed by \E later in the pattern, the literal interpretation
4833 continues to the end of the pattern (that is, \E is assumed at the
4834 end). If the isolated \Q is inside a character class, this causes an
4835 error, because the character class is not terminated.
4836
4837 Non-printing characters
4838
4839 A second use of backslash provides a way of encoding non-printing char-
4840 acters in patterns in a visible manner. There is no restriction on the
4841 appearance of non-printing characters, apart from the binary zero that
4842 terminates a pattern, but when a pattern is being prepared by text
4843 editing, it is often easier to use one of the following escape
4844 sequences than the binary character it represents:
4845
4846 \a alarm, that is, the BEL character (hex 07)
4847 \cx "control-x", where x is any ASCII character
4848 \e escape (hex 1B)
4849 \f form feed (hex 0C)
4850 \n linefeed (hex 0A)
4851 \r carriage return (hex 0D)
4852 \t tab (hex 09)
4853 \ddd character with octal code ddd, or back reference
4854 \xhh character with hex code hh
4855 \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
4856 \uhhhh character with hex code hhhh (JavaScript mode only)
4857
4858 The precise effect of \cx on ASCII characters is as follows: if x is a
4859 lower case letter, it is converted to upper case. Then bit 6 of the
4860 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
4861 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
4862 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
4863 has a value greater than 127, a compile-time error occurs. This locks
4864 out non-ASCII characters in all modes.
4865
4866 The \c facility was designed for use with ASCII characters, but with
4867 the extension to Unicode it is even less useful than it once was. It
4868 is, however, recognized when PCRE is compiled in EBCDIC mode, where
4869 data items are always bytes. In this mode, all values are valid after
4870 \c. If the next character is a lower case letter, it is converted to
4871 upper case. Then the 0xc0 bits of the byte are inverted. Thus \cA
4872 becomes hex 01, as in ASCII (A is C1), but because the EBCDIC letters
4873 are disjoint, \cZ becomes hex 29 (Z is E9), and other characters also
4874 generate different values.
4875
4876 By default, after \x, from zero to two hexadecimal digits are read
4877 (letters can be in upper or lower case). Any number of hexadecimal dig-
4878 its may appear between \x{ and }, but the character code is constrained
4879 as follows:
4880
4881 8-bit non-UTF mode less than 0x100
4882 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
4883 16-bit non-UTF mode less than 0x10000
4884 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
4885 32-bit non-UTF mode less than 0x80000000
4886 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
4887
4888 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
4889 called "surrogate" codepoints), and 0xffef.
4890
4891 If characters other than hexadecimal digits appear between \x{ and },
4892 or if there is no terminating }, this form of escape is not recognized.
4893 Instead, the initial \x will be interpreted as a basic hexadecimal
4894 escape, with no following digits, giving a character whose value is
4895 zero.
4896
4897 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
4898 is as just described only when it is followed by two hexadecimal dig-
4899 its. Otherwise, it matches a literal "x" character. In JavaScript
4900 mode, support for code points greater than 256 is provided by \u, which
4901 must be followed by four hexadecimal digits; otherwise it matches a
4902 literal "u" character. Character codes specified by \u in JavaScript
4903 mode are constrained in the same was as those specified by \x in non-
4904 JavaScript mode.
4905
4906 Characters whose value is less than 256 can be defined by either of the
4907 two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
4908 ence in the way they are handled. For example, \xdc is exactly the same
4909 as \x{dc} (or \u00dc in JavaScript mode).
4910
4911 After \0 up to two further octal digits are read. If there are fewer
4912 than two digits, just those that are present are used. Thus the
4913 sequence \0\x\07 specifies two binary zeros followed by a BEL character
4914 (code value 7). Make sure you supply two digits after the initial zero
4915 if the pattern character that follows is itself an octal digit.
4916
4917 The handling of a backslash followed by a digit other than 0 is compli-
4918 cated. Outside a character class, PCRE reads it and any following dig-
4919 its as a decimal number. If the number is less than 10, or if there
4920 have been at least that many previous capturing left parentheses in the
4921 expression, the entire sequence is taken as a back reference. A
4922 description of how this works is given later, following the discussion
4923 of parenthesized subpatterns.
4924
4925 Inside a character class, or if the decimal number is greater than 9
4926 and there have not been that many capturing subpatterns, PCRE re-reads
4927 up to three octal digits following the backslash, and uses them to gen-
4928 erate a data character. Any subsequent digits stand for themselves. The
4929 value of the character is constrained in the same way as characters
4930 specified in hexadecimal. For example:
4931
4932 \040 is another way of writing an ASCII space
4933 \40 is the same, provided there are fewer than 40
4934 previous capturing subpatterns
4935 \7 is always a back reference
4936 \11 might be a back reference, or another way of
4937 writing a tab
4938 \011 is always a tab
4939 \0113 is a tab followed by the character "3"
4940 \113 might be a back reference, otherwise the
4941 character with octal code 113
4942 \377 might be a back reference, otherwise
4943 the value 255 (decimal)
4944 \81 is either a back reference, or a binary zero
4945 followed by the two characters "8" and "1"
4946
4947 Note that octal values of 100 or greater must not be introduced by a
4948 leading zero, because no more than three octal digits are ever read.
4949
4950 All the sequences that define a single character value can be used both
4951 inside and outside character classes. In addition, inside a character
4952 class, \b is interpreted as the backspace character (hex 08).
4953
4954 \N is not allowed in a character class. \B, \R, and \X are not special
4955 inside a character class. Like other unrecognized escape sequences,
4956 they are treated as the literal characters "B", "R", and "X" by
4957 default, but cause an error if the PCRE_EXTRA option is set. Outside a
4958 character class, these sequences have different meanings.
4959
4960 Unsupported escape sequences
4961
4962 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
4963 handler and used to modify the case of following characters. By
4964 default, PCRE does not support these escape sequences. However, if the
4965 PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
4966 \u can be used to define a character by code point, as described in the
4967 previous section.
4968
4969 Absolute and relative back references
4970
4971 The sequence \g followed by an unsigned or a negative number, option-
4972 ally enclosed in braces, is an absolute or relative back reference. A
4973 named back reference can be coded as \g{name}. Back references are dis-
4974 cussed later, following the discussion of parenthesized subpatterns.
4975
4976 Absolute and relative subroutine calls
4977
4978 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4979 name or a number enclosed either in angle brackets or single quotes, is
4980 an alternative syntax for referencing a subpattern as a "subroutine".
4981 Details are discussed later. Note that \g{...} (Perl syntax) and
4982 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
4983 reference; the latter is a subroutine call.
4984
4985 Generic character types
4986
4987 Another use of backslash is for specifying generic character types:
4988
4989 \d any decimal digit
4990 \D any character that is not a decimal digit
4991 \h any horizontal white space character
4992 \H any character that is not a horizontal white space character
4993 \s any white space character
4994 \S any character that is not a white space character
4995 \v any vertical white space character
4996 \V any character that is not a vertical white space character
4997 \w any "word" character
4998 \W any "non-word" character
4999
5000 There is also the single sequence \N, which matches a non-newline char-
5001 acter. This is the same as the "." metacharacter when PCRE_DOTALL is
5002 not set. Perl also uses \N to match characters by name; PCRE does not
5003 support this.
5004
5005 Each pair of lower and upper case escape sequences partitions the com-
5006 plete set of characters into two disjoint sets. Any given character
5007 matches one, and only one, of each pair. The sequences can appear both
5008 inside and outside character classes. They each match one character of
5009 the appropriate type. If the current matching point is at the end of
5010 the subject string, all of them fail, because there is no character to
5011 match.
5012
5013 For compatibility with Perl, \s does not match the VT character (code
5014 11). This makes it different from the the POSIX "space" class. The \s
5015 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
5016 "use locale;" is included in a Perl script, \s may match the VT charac-
5017 ter. In PCRE, it never does.
5018
5019 A "word" character is an underscore or any character that is a letter
5020 or digit. By default, the definition of letters and digits is con-
5021 trolled by PCRE's low-valued character tables, and may vary if locale-
5022 specific matching is taking place (see "Locale support" in the pcreapi
5023 page). For example, in a French locale such as "fr_FR" in Unix-like
5024 systems, or "french" in Windows, some character codes greater than 128
5025 are used for accented letters, and these are then matched by \w. The
5026 use of locales with Unicode is discouraged.
5027
5028 By default, in a UTF mode, characters with values greater than 128
5029 never match \d, \s, or \w, and always match \D, \S, and \W. These
5030 sequences retain their original meanings from before UTF support was
5031 available, mainly for efficiency reasons. However, if PCRE is compiled
5032 with Unicode property support, and the PCRE_UCP option is set, the be-
5033 haviour is changed so that Unicode properties are used to determine
5034 character types, as follows:
5035
5036 \d any character that \p{Nd} matches (decimal digit)
5037 \s any character that \p{Z} matches, plus HT, LF, FF, CR
5038 \w any character that \p{L} or \p{N} matches, plus underscore
5039
5040 The upper case escapes match the inverse sets of characters. Note that
5041 \d matches only decimal digits, whereas \w matches any Unicode digit,
5042 as well as any Unicode letter, and underscore. Note also that PCRE_UCP
5043 affects \b, and \B because they are defined in terms of \w and \W.
5044 Matching these sequences is noticeably slower when PCRE_UCP is set.
5045
5046 The sequences \h, \H, \v, and \V are features that were added to Perl
5047 at release 5.10. In contrast to the other sequences, which match only
5048 ASCII characters by default, these always match certain high-valued
5049 codepoints, whether or not PCRE_UCP is set. The horizontal space char-
5050 acters are:
5051
5052 U+0009 Horizontal tab (HT)
5053 U+0020 Space
5054 U+00A0 Non-break space
5055 U+1680 Ogham space mark
5056 U+180E Mongolian vowel separator
5057 U+2000 En quad
5058 U+2001 Em quad
5059 U+2002 En space
5060 U+2003 Em space
5061 U+2004 Three-per-em space
5062 U+2005 Four-per-em space
5063 U+2006 Six-per-em space
5064 U+2007 Figure space
5065 U+2008 Punctuation space
5066 U+2009 Thin space
5067 U+200A Hair space
5068 U+202F Narrow no-break space
5069 U+205F Medium mathematical space
5070 U+3000 Ideographic space
5071
5072 The vertical space characters are:
5073
5074 U+000A Linefeed (LF)
5075 U+000B Vertical tab (VT)
5076 U+000C Form feed (FF)
5077 U+000D Carriage return (CR)
5078 U+0085 Next line (NEL)
5079 U+2028 Line separator
5080 U+2029 Paragraph separator
5081
5082 In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
5083 256 are relevant.
5084
5085 Newline sequences
5086
5087 Outside a character class, by default, the escape sequence \R matches
5088 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
5089 to the following:
5090
5091 (?>\r\n|\n|\x0b|\f|\r|\x85)
5092
5093 This is an example of an "atomic group", details of which are given
5094 below. This particular group matches either the two-character sequence
5095 CR followed by LF, or one of the single characters LF (linefeed,
5096 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
5097 riage return, U+000D), or NEL (next line, U+0085). The two-character
5098 sequence is treated as a single unit that cannot be split.
5099
5100 In other modes, two additional characters whose codepoints are greater
5101 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
5102 rator, U+2029). Unicode character property support is not needed for
5103 these characters to be recognized.
5104
5105 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
5106 the complete set of Unicode line endings) by setting the option
5107 PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
5108 (BSR is an abbrevation for "backslash R".) This can be made the default
5109 when PCRE is built; if this is the case, the other behaviour can be
5110 requested via the PCRE_BSR_UNICODE option. It is also possible to
5111 specify these settings by starting a pattern string with one of the
5112 following sequences:
5113
5114 (*BSR_ANYCRLF) CR, LF, or CRLF only
5115 (*BSR_UNICODE) any Unicode newline sequence
5116
5117 These override the default and the options given to the compiling func-
5118 tion, but they can themselves be overridden by options given to a
5119 matching function. Note that these special settings, which are not
5120 Perl-compatible, are recognized only at the very start of a pattern,
5121 and that they must be in upper case. If more than one of them is
5122 present, the last one is used. They can be combined with a change of
5123 newline convention; for example, a pattern can start with:
5124
5125 (*ANY)(*BSR_ANYCRLF)
5126
5127 They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
5128 or (*UCP) special sequences. Inside a character class, \R is treated as
5129 an unrecognized escape sequence, and so matches the letter "R" by
5130 default, but causes an error if PCRE_EXTRA is set.
5131
5132 Unicode character properties
5133
5134 When PCRE is built with Unicode character property support, three addi-
5135 tional escape sequences that match characters with specific properties
5136 are available. When in 8-bit non-UTF-8 mode, these sequences are of
5137 course limited to testing characters whose codepoints are less than
5138 256, but they do work in this mode. The extra escape sequences are:
5139
5140 \p{xx} a character with the xx property
5141 \P{xx} a character without the xx property
5142 \X a Unicode extended grapheme cluster
5143
5144 The property names represented by xx above are limited to the Unicode
5145 script names, the general category properties, "Any", which matches any
5146 character (including newline), and some special PCRE properties
5147 (described in the next section). Other Perl properties such as "InMu-
5148 sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
5149 does not match any characters, so always causes a match failure.
5150
5151 Sets of Unicode characters are defined as belonging to certain scripts.
5152 A character from one of these sets can be matched using a script name.
5153 For example:
5154
5155 \p{Greek}
5156 \P{Han}
5157
5158 Those that are not part of an identified script are lumped together as
5159 "Common". The current list of scripts is:
5160
5161 Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
5162 Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
5163 Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
5164 Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,
5165 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
5166 gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
5167 tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
5168 Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
5169 Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,
5170 Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,
5171 Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,
5172 Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-
5173 tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,
5174 Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
5175 Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
5176 Yi.
5177
5178 Each character has exactly one Unicode general category property, spec-
5179 ified by a two-letter abbreviation. For compatibility with Perl, nega-
5180 tion can be specified by including a circumflex between the opening
5181 brace and the property name. For example, \p{^Lu} is the same as
5182 \P{Lu}.
5183
5184 If only one letter is specified with \p or \P, it includes all the gen-
5185 eral category properties that start with that letter. In this case, in
5186 the absence of negation, the curly brackets in the escape sequence are
5187 optional; these two examples have the same effect:
5188
5189 \p{L}
5190 \pL
5191
5192 The following general category property codes are supported:
5193
5194 C Other
5195 Cc Control
5196 Cf Format
5197 Cn Unassigned
5198 Co Private use
5199 Cs Surrogate
5200
5201 L Letter
5202 Ll Lower case letter
5203 Lm Modifier letter
5204 Lo Other letter
5205 Lt Title case letter
5206 Lu Upper case letter
5207
5208 M Mark
5209 Mc Spacing mark
5210 Me Enclosing mark
5211 Mn Non-spacing mark
5212
5213 N Number
5214 Nd Decimal number
5215 Nl Letter number
5216 No Other number
5217
5218 P Punctuation
5219 Pc Connector punctuation
5220 Pd Dash punctuation
5221 Pe Close punctuation
5222 Pf Final punctuation
5223 Pi Initial punctuation
5224 Po Other punctuation
5225 Ps Open punctuation
5226
5227 S Symbol
5228 Sc Currency symbol
5229 Sk Modifier symbol
5230 Sm Mathematical symbol
5231 So Other symbol
5232
5233 Z Separator
5234 Zl Line separator
5235 Zp Paragraph separator
5236 Zs Space separator
5237
5238 The special property L& is also supported: it matches a character that
5239 has the Lu, Ll, or Lt property, in other words, a letter that is not
5240 classified as a modifier or "other".
5241
5242 The Cs (Surrogate) property applies only to characters in the range
5243 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
5244 so cannot be tested by PCRE, unless UTF validity checking has been
5245 turned off (see the discussion of PCRE_NO_UTF8_CHECK,
5246 PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
5247 does not support the Cs property.
5248
5249 The long synonyms for property names that Perl supports (such as
5250 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
5251 any of these properties with "Is".
5252
5253 No character that is in the Unicode table has the Cn (unassigned) prop-
5254 erty. Instead, this property is assumed for any code point that is not
5255 in the Unicode table.
5256
5257 Specifying caseless matching does not affect these escape sequences.
5258 For example, \p{Lu} always matches only upper case letters. This is
5259 different from the behaviour of current versions of Perl.
5260
5261 Matching characters by Unicode property is not fast, because PCRE has
5262 to do a multistage table lookup in order to find a character's prop-
5263 erty. That is why the traditional escape sequences such as \d and \w do
5264 not use Unicode properties in PCRE by default, though you can make them
5265 do so by setting the PCRE_UCP option or by starting the pattern with
5266 (*UCP).
5267
5268 Extended grapheme clusters
5269
5270 The \X escape matches any number of Unicode characters that form an
5271 "extended grapheme cluster", and treats the sequence as an atomic group
5272 (see below). Up to and including release 8.31, PCRE matched an ear-
5273 lier, simpler definition that was equivalent to
5274
5275 (?>\PM\pM*)
5276
5277 That is, it matched a character without the "mark" property, followed
5278 by zero or more characters with the "mark" property. Characters with
5279 the "mark" property are typically non-spacing accents that affect the
5280 preceding character.
5281
5282 This simple definition was extended in Unicode to include more compli-
5283 cated kinds of composite character by giving each character a grapheme
5284 breaking property, and creating rules that use these properties to
5285 define the boundaries of extended grapheme clusters. In releases of
5286 PCRE later than 8.31, \X matches one of these clusters.
5287
5288 \X always matches at least one character. Then it decides whether to
5289 add additional characters according to the following rules for ending a
5290 cluster:
5291
5292 1. End at the end of the subject string.
5293
5294 2. Do not end between CR and LF; otherwise end after any control char-
5295 acter.
5296
5297 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
5298 characters are of five types: L, V, T, LV, and LVT. An L character may
5299 be followed by an L, V, LV, or LVT character; an LV or V character may
5300 be followed by a V or T character; an LVT or T character may be follwed
5301 only by a T character.
5302
5303 4. Do not end before extending characters or spacing marks. Characters
5304 with the "mark" property always have the "extend" grapheme breaking
5305 property.
5306
5307 5. Do not end after prepend characters.
5308
5309 6. Otherwise, end the cluster.
5310
5311 PCRE's additional properties
5312
5313 As well as the standard Unicode properties described above, PCRE sup-
5314 ports four more that make it possible to convert traditional escape
5315 sequences such as \w and \s and POSIX character classes to use Unicode
5316 properties. PCRE uses these non-standard, non-Perl properties inter-
5317 nally when PCRE_UCP is set. However, they may also be used explicitly.
5318 These properties are:
5319
5320 Xan Any alphanumeric character
5321 Xps Any POSIX space character
5322 Xsp Any Perl space character
5323 Xwd Any Perl "word" character
5324
5325 Xan matches characters that have either the L (letter) or the N (num-
5326 ber) property. Xps matches the characters tab, linefeed, vertical tab,
5327 form feed, or carriage return, and any other character that has the Z
5328 (separator) property. Xsp is the same as Xps, except that vertical tab
5329 is excluded. Xwd matches the same characters as Xan, plus underscore.
5330
5331 There is another non-standard property, Xuc, which matches any charac-
5332 ter that can be represented by a Universal Character Name in C++ and
5333 other programming languages. These are the characters $, @, ` (grave
5334 accent), and all characters with Unicode code points greater than or
5335 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
5336 most base (ASCII) characters are excluded. (Universal Character Names
5337 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
5338 Note that the Xuc property does not match these sequences but the char-
5339 acters that they represent.)
5340
5341 Resetting the match start
5342
5343 The escape sequence \K causes any previously matched characters not to
5344 be included in the final matched sequence. For example, the pattern:
5345
5346 foo\Kbar
5347
5348 matches "foobar", but reports that it has matched "bar". This feature
5349 is similar to a lookbehind assertion (described below). However, in
5350 this case, the part of the subject before the real match does not have
5351 to be of fixed length, as lookbehind assertions do. The use of \K does
5352 not interfere with the setting of captured substrings. For example,
5353 when the pattern
5354
5355 (foo)\Kbar
5356
5357 matches "foobar", the first substring is still set to "foo".
5358
5359 Perl documents that the use of \K within assertions is "not well
5360 defined". In PCRE, \K is acted upon when it occurs inside positive
5361 assertions, but is ignored in negative assertions.
5362
5363 Simple assertions
5364
5365 The final use of backslash is for certain simple assertions. An asser-
5366 tion specifies a condition that has to be met at a particular point in
5367 a match, without consuming any characters from the subject string. The
5368 use of subpatterns for more complicated assertions is described below.
5369 The backslashed assertions are:
5370
5371 \b matches at a word boundary
5372 \B matches when not at a word boundary
5373 \A matches at the start of the subject
5374 \Z matches at the end of the subject
5375 also matches before a newline at the end of the subject
5376 \z matches only at the end of the subject
5377 \G matches at the first matching position in the subject
5378
5379 Inside a character class, \b has a different meaning; it matches the
5380 backspace character. If any other of these assertions appears in a
5381 character class, by default it matches the corresponding literal char-
5382 acter (for example, \B matches the letter B). However, if the
5383 PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
5384 ated instead.
5385
5386 A word boundary is a position in the subject string where the current
5387 character and the previous character do not both match \w or \W (i.e.
5388 one matches \w and the other matches \W), or the start or end of the
5389 string if the first or last character matches \w, respectively. In a
5390 UTF mode, the meanings of \w and \W can be changed by setting the
5391 PCRE_UCP option. When this is done, it also affects \b and \B. Neither
5392 PCRE nor Perl has a separate "start of word" or "end of word" metase-
5393 quence. However, whatever follows \b normally determines which it is.
5394 For example, the fragment \ba matches "a" at the start of a word.
5395
5396 The \A, \Z, and \z assertions differ from the traditional circumflex
5397 and dollar (described in the next section) in that they only ever match
5398 at the very start and end of the subject string, whatever options are
5399 set. Thus, they are independent of multiline mode. These three asser-
5400 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
5401 affect only the behaviour of the circumflex and dollar metacharacters.
5402 However, if the startoffset argument of pcre_exec() is non-zero, indi-
5403 cating that matching is to start at a point other than the beginning of
5404 the subject, \A can never match. The difference between \Z and \z is
5405 that \Z matches before a newline at the end of the string as well as at
5406 the very end, whereas \z matches only at the end.
5407
5408 The \G assertion is true only when the current matching position is at
5409 the start point of the match, as specified by the startoffset argument
5410 of pcre_exec(). It differs from \A when the value of startoffset is
5411 non-zero. By calling pcre_exec() multiple times with appropriate argu-
5412 ments, you can mimic Perl's /g option, and it is in this kind of imple-
5413 mentation where \G can be useful.
5414
5415 Note, however, that PCRE's interpretation of \G, as the start of the
5416 current match, is subtly different from Perl's, which defines it as the
5417 end of the previous match. In Perl, these can be different when the
5418 previously matched string was empty. Because PCRE does just one match
5419 at a time, it cannot reproduce this behaviour.
5420
5421 If all the alternatives of a pattern begin with \G, the expression is
5422 anchored to the starting match position, and the "anchored" flag is set
5423 in the compiled regular expression.
5424
5425
5426 CIRCUMFLEX AND DOLLAR
5427
5428 The circumflex and dollar metacharacters are zero-width assertions.
5429 That is, they test for a particular condition being true without con-
5430 suming any characters from the subject string.
5431
5432 Outside a character class, in the default matching mode, the circumflex
5433 character is an assertion that is true only if the current matching
5434 point is at the start of the subject string. If the startoffset argu-
5435 ment of pcre_exec() is non-zero, circumflex can never match if the
5436 PCRE_MULTILINE option is unset. Inside a character class, circumflex
5437 has an entirely different meaning (see below).
5438
5439 Circumflex need not be the first character of the pattern if a number
5440 of alternatives are involved, but it should be the first thing in each
5441 alternative in which it appears if the pattern is ever to match that
5442 branch. If all possible alternatives start with a circumflex, that is,
5443 if the pattern is constrained to match only at the start of the sub-
5444 ject, it is said to be an "anchored" pattern. (There are also other
5445 constructs that can cause a pattern to be anchored.)
5446
5447 The dollar character is an assertion that is true only if the current
5448 matching point is at the end of the subject string, or immediately
5449 before a newline at the end of the string (by default). Note, however,
5450 that it does not actually match the newline. Dollar need not be the
5451 last character of the pattern if a number of alternatives are involved,
5452 but it should be the last item in any branch in which it appears. Dol-
5453 lar has no special meaning in a character class.
5454
5455 The meaning of dollar can be changed so that it matches only at the
5456 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
5457 compile time. This does not affect the \Z assertion.
5458
5459 The meanings of the circumflex and dollar characters are changed if the
5460 PCRE_MULTILINE option is set. When this is the case, a circumflex
5461 matches immediately after internal newlines as well as at the start of
5462 the subject string. It does not match after a newline that ends the
5463 string. A dollar matches before any newlines in the string, as well as
5464 at the very end, when PCRE_MULTILINE is set. When newline is specified
5465 as the two-character sequence CRLF, isolated CR and LF characters do
5466 not indicate newlines.
5467
5468 For example, the pattern /^abc$/ matches the subject string "def\nabc"
5469 (where \n represents a newline) in multiline mode, but not otherwise.
5470 Consequently, patterns that are anchored in single line mode because
5471 all branches start with ^ are not anchored in multiline mode, and a
5472 match for circumflex is possible when the startoffset argument of
5473 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
5474 PCRE_MULTILINE is set.
5475
5476 Note that the sequences \A, \Z, and \z can be used to match the start
5477 and end of the subject in both modes, and if all branches of a pattern
5478 start with \A it is always anchored, whether or not PCRE_MULTILINE is
5479 set.
5480
5481
5482 FULL STOP (PERIOD, DOT) AND \N
5483
5484 Outside a character class, a dot in the pattern matches any one charac-
5485 ter in the subject string except (by default) a character that signi-
5486 fies the end of a line.
5487
5488 When a line ending is defined as a single character, dot never matches
5489 that character; when the two-character sequence CRLF is used, dot does
5490 not match CR if it is immediately followed by LF, but otherwise it
5491 matches all characters (including isolated CRs and LFs). When any Uni-
5492 code line endings are being recognized, dot does not match CR or LF or
5493 any of the other line ending characters.
5494
5495 The behaviour of dot with regard to newlines can be changed. If the
5496 PCRE_DOTALL option is set, a dot matches any one character, without
5497 exception. If the two-character sequence CRLF is present in the subject
5498 string, it takes two dots to match it.
5499
5500 The handling of dot is entirely independent of the handling of circum-
5501 flex and dollar, the only relationship being that they both involve
5502 newlines. Dot has no special meaning in a character class.
5503
5504 The escape sequence \N behaves like a dot, except that it is not
5505 affected by the PCRE_DOTALL option. In other words, it matches any
5506 character except one that signifies the end of a line. Perl also uses
5507 \N to match characters by name; PCRE does not support this.
5508
5509
5510 MATCHING A SINGLE DATA UNIT
5511
5512 Outside a character class, the escape sequence \C matches any one data
5513 unit, whether or not a UTF mode is set. In the 8-bit library, one data
5514 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
5515 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
5516 line-ending characters. The feature is provided in Perl in order to
5517 match individual bytes in UTF-8 mode, but it is unclear how it can use-
5518 fully be used. Because \C breaks up characters into individual data
5519 units, matching one unit with \C in a UTF mode means that the rest of
5520 the string may start with a malformed UTF character. This has undefined
5521 results, because PCRE assumes that it is dealing with valid UTF strings
5522 (and by default it checks this at the start of processing unless the
5523 PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option
5524 is used).
5525
5526 PCRE does not allow \C to appear in lookbehind assertions (described
5527 below) in a UTF mode, because this would make it impossible to calcu-
5528 late the length of the lookbehind.
5529
5530 In general, the \C escape sequence is best avoided. However, one way of
5531 using it that avoids the problem of malformed UTF characters is to use
5532 a lookahead to check the length of the next character, as in this pat-
5533 tern, which could be used with a UTF-8 string (ignore white space and
5534 line breaks):
5535
5536 (?| (?=[\x00-\x7f])(\C) |
5537 (?=[\x80-\x{7ff}])(\C)(\C) |
5538 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
5539 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
5540
5541 A group that starts with (?| resets the capturing parentheses numbers
5542 in each alternative (see "Duplicate Subpattern Numbers" below). The
5543 assertions at the start of each branch check the next UTF-8 character
5544 for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
5545 character's individual bytes are then captured by the appropriate num-
5546 ber of groups.
5547
5548
5549 SQUARE BRACKETS AND CHARACTER CLASSES
5550
5551 An opening square bracket introduces a character class, terminated by a
5552 closing square bracket. A closing square bracket on its own is not spe-
5553 cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
5554 a lone closing square bracket causes a compile-time error. If a closing
5555 square bracket is required as a member of the class, it should be the
5556 first data character in the class (after an initial circumflex, if
5557 present) or escaped with a backslash.
5558
5559 A character class matches a single character in the subject. In a UTF
5560 mode, the character may be more than one data unit long. A matched
5561 character must be in the set of characters defined by the class, unless
5562 the first character in the class definition is a circumflex, in which
5563 case the subject character must not be in the set defined by the class.
5564 If a circumflex is actually required as a member of the class, ensure
5565 it is not the first character, or escape it with a backslash.
5566
5567 For example, the character class [aeiou] matches any lower case vowel,
5568 while [^aeiou] matches any character that is not a lower case vowel.
5569 Note that a circumflex is just a convenient notation for specifying the
5570 characters that are in the class by enumerating those that are not. A
5571 class that starts with a circumflex is not an assertion; it still con-
5572 sumes a character from the subject string, and therefore it fails if
5573 the current pointer is at the end of the string.
5574
5575 In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
5576 (0xffff) can be included in a class as a literal string of data units,
5577 or by using the \x{ escaping mechanism.
5578
5579 When caseless matching is set, any letters in a class represent both
5580 their upper case and lower case versions, so for example, a caseless
5581 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
5582 match "A", whereas a caseful version would. In a UTF mode, PCRE always
5583 understands the concept of case for characters whose values are less
5584 than 128, so caseless matching is always possible. For characters with
5585 higher values, the concept of case is supported if PCRE is compiled
5586 with Unicode property support, but not otherwise. If you want to use
5587 caseless matching in a UTF mode for characters 128 and above, you must
5588 ensure that PCRE is compiled with Unicode property support as well as
5589 with UTF support.
5590
5591 Characters that might indicate line breaks are never treated in any
5592 special way when matching character classes, whatever line-ending
5593 sequence is in use, and whatever setting of the PCRE_DOTALL and
5594 PCRE_MULTILINE options is used. A class such as [^a] always matches one
5595 of these characters.
5596
5597 The minus (hyphen) character can be used to specify a range of charac-
5598 ters in a character class. For example, [d-m] matches any letter
5599 between d and m, inclusive. If a minus character is required in a
5600 class, it must be escaped with a backslash or appear in a position
5601 where it cannot be interpreted as indicating a range, typically as the
5602 first or last character in the class.
5603
5604 It is not possible to have the literal character "]" as the end charac-
5605 ter of a range. A pattern such as [W-]46] is interpreted as a class of
5606 two characters ("W" and "-") followed by a literal string "46]", so it
5607 would match "W46]" or "-46]". However, if the "]" is escaped with a
5608 backslash it is interpreted as the end of range, so [W-\]46] is inter-
5609 preted as a class containing a range followed by two other characters.
5610 The octal or hexadecimal representation of "]" can also be used to end
5611 a range.
5612
5613 Ranges operate in the collating sequence of character values. They can
5614 also be used for characters specified numerically, for example
5615 [\000-\037]. Ranges can include any characters that are valid for the
5616 current mode.
5617
5618 If a range that includes letters is used when caseless matching is set,
5619 it matches the letters in either case. For example, [W-c] is equivalent
5620 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
5621 character tables for a French locale are in use, [\xc8-\xcb] matches
5622 accented E characters in both cases. In UTF modes, PCRE supports the
5623 concept of case for characters with values greater than 128 only when
5624 it is compiled with Unicode property support.
5625
5626 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
5627 \w, and \W may appear in a character class, and add the characters that
5628 they match to the class. For example, [\dABCDEF] matches any hexadeci-
5629 mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
5630 \d, \s, \w and their upper case partners, just as it does when they
5631 appear outside a character class, as described in the section entitled
5632 "Generic character types" above. The escape sequence \b has a different
5633 meaning inside a character class; it matches the backspace character.
5634 The sequences \B, \N, \R, and \X are not special inside a character
5635 class. Like any other unrecognized escape sequences, they are treated
5636 as the literal characters "B", "N", "R", and "X" by default, but cause
5637 an error if the PCRE_EXTRA option is set.
5638
5639 A circumflex can conveniently be used with the upper case character
5640 types to specify a more restricted set of characters than the matching
5641 lower case type. For example, the class [^\W_] matches any letter or
5642 digit, but not underscore, whereas [\w] includes underscore. A positive
5643 character class should be read as "something OR something OR ..." and a
5644 negative class as "NOT something AND NOT something AND NOT ...".
5645
5646 The only metacharacters that are recognized in character classes are
5647 backslash, hyphen (only where it can be interpreted as specifying a
5648 range), circumflex (only at the start), opening square bracket (only
5649 when it can be interpreted as introducing a POSIX class name - see the
5650 next section), and the terminating closing square bracket. However,
5651 escaping other non-alphanumeric characters does no harm.
5652
5653
5654 POSIX CHARACTER CLASSES
5655
5656 Perl supports the POSIX notation for character classes. This uses names
5657 enclosed by [: and :] within the enclosing square brackets. PCRE also
5658 supports this notation. For example,
5659
5660 [01[:alpha:]%]
5661
5662 matches "0", "1", any alphabetic character, or "%". The supported class
5663 names are:
5664
5665 alnum letters and digits
5666 alpha letters
5667 ascii character codes 0 - 127
5668 blank space or tab only
5669 cntrl control characters
5670 digit decimal digits (same as \d)
5671 graph printing characters, excluding space
5672 lower lower case letters
5673 print printing characters, including space
5674 punct printing characters, excluding letters and digits and space
5675 space white space (not quite the same as \s)
5676 upper upper case letters
5677 word "word" characters (same as \w)
5678 xdigit hexadecimal digits
5679
5680 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
5681 and space (32). Notice that this list includes the VT character (code
5682 11). This makes "space" different to \s, which does not include VT (for
5683 Perl compatibility).
5684
5685 The name "word" is a Perl extension, and "blank" is a GNU extension
5686 from Perl 5.8. Another Perl extension is negation, which is indicated
5687 by a ^ character after the colon. For example,
5688
5689 [12[:^digit:]]
5690
5691 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
5692 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5693 these are not supported, and an error is given if they are encountered.
5694
5695 By default, in UTF modes, characters with values greater than 128 do
5696 not match any of the POSIX character classes. However, if the PCRE_UCP
5697 option is passed to pcre_compile(), some of the classes are changed so
5698 that Unicode character properties are used. This is achieved by replac-
5699 ing the POSIX classes by other sequences, as follows:
5700
5701 [:alnum:] becomes \p{Xan}
5702 [:alpha:] becomes \p{L}
5703 [:blank:] becomes \h
5704 [:digit:] becomes \p{Nd}
5705 [:lower:] becomes \p{Ll}
5706 [:space:] becomes \p{Xps}
5707 [:upper:] becomes \p{Lu}
5708 [:word:] becomes \p{Xwd}
5709
5710 Negated versions, such as [:^alpha:] use \P instead of \p. The other
5711 POSIX classes are unchanged, and match only characters with code points
5712 less than 128.
5713
5714
5715 VERTICAL BAR
5716
5717 Vertical bar characters are used to separate alternative patterns. For
5718 example, the pattern
5719
5720 gilbert|sullivan
5721
5722 matches either "gilbert" or "sullivan". Any number of alternatives may
5723 appear, and an empty alternative is permitted (matching the empty
5724 string). The matching process tries each alternative in turn, from left
5725 to right, and the first one that succeeds is used. If the alternatives
5726 are within a subpattern (defined below), "succeeds" means matching the
5727 rest of the main pattern as well as the alternative in the subpattern.
5728
5729
5730 INTERNAL OPTION SETTING
5731
5732 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
5733 PCRE_EXTENDED options (which are Perl-compatible) can be changed from
5734 within the pattern by a sequence of Perl option letters enclosed
5735 between "(?" and ")". The option letters are
5736
5737 i for PCRE_CASELESS
5738 m for PCRE_MULTILINE
5739 s for PCRE_DOTALL
5740 x for PCRE_EXTENDED
5741
5742 For example, (?im) sets caseless, multiline matching. It is also possi-
5743 ble to unset these options by preceding the letter with a hyphen, and a
5744 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
5745 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
5746 is also permitted. If a letter appears both before and after the
5747 hyphen, the option is unset.
5748
5749 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
5750 can be changed in the same way as the Perl-compatible options by using
5751 the characters J, U and X respectively.
5752
5753 When one of these option changes occurs at top level (that is, not
5754 inside subpattern parentheses), the change applies to the remainder of
5755 the pattern that follows. If the change is placed right at the start of
5756 a pattern, PCRE extracts it into the global options (and it will there-
5757 fore show up in data extracted by the pcre_fullinfo() function).
5758
5759 An option change within a subpattern (see below for a description of
5760 subpatterns) affects only that part of the subpattern that follows it,
5761 so
5762
5763 (a(?i)b)c
5764
5765 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
5766 used). By this means, options can be made to have different settings
5767 in different parts of the pattern. Any changes made in one alternative
5768 do carry on into subsequent branches within the same subpattern. For
5769 example,
5770
5771 (a(?i)b|c)
5772
5773 matches "ab", "aB", "c", and "C", even though when matching "C" the
5774 first branch is abandoned before the option setting. This is because
5775 the effects of option settings happen at compile time. There would be
5776 some very weird behaviour otherwise.
5777
5778 Note: There are other PCRE-specific options that can be set by the
5779 application when the compiling or matching functions are called. In
5780 some cases the pattern can contain special leading sequences such as
5781 (*CRLF) to override what the application has set or what has been
5782 defaulted. Details are given in the section entitled "Newline
5783 sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and
5784 (*UCP) leading sequences that can be used to set UTF and Unicode prop-
5785 erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
5786 PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence
5787 is a generic version that can be used with any of the libraries. How-
5788 ever, the application can set the PCRE_NEVER_UTF option, which locks
5789 out the use of the (*UTF) sequences.
5790
5791
5792 SUBPATTERNS
5793
5794 Subpatterns are delimited by parentheses (round brackets), which can be
5795 nested. Turning part of a pattern into a subpattern does two things:
5796
5797 1. It localizes a set of alternatives. For example, the pattern
5798
5799 cat(aract|erpillar|)
5800
5801 matches "cataract", "caterpillar", or "cat". Without the parentheses,
5802 it would match "cataract", "erpillar" or an empty string.
5803
5804 2. It sets up the subpattern as a capturing subpattern. This means
5805 that, when the whole pattern matches, that portion of the subject
5806 string that matched the subpattern is passed back to the caller via the
5807 ovector argument of the matching function. (This applies only to the
5808 traditional matching functions; the DFA matching functions do not sup-
5809 port capturing.)
5810
5811 Opening parentheses are counted from left to right (starting from 1) to
5812 obtain numbers for the capturing subpatterns. For example, if the
5813 string "the red king" is matched against the pattern
5814
5815 the ((red|white) (king|queen))
5816
5817 the captured substrings are "red king", "red", and "king", and are num-
5818 bered 1, 2, and 3, respectively.
5819
5820 The fact that plain parentheses fulfil two functions is not always
5821 helpful. There are often times when a grouping subpattern is required
5822 without a capturing requirement. If an opening parenthesis is followed
5823 by a question mark and a colon, the subpattern does not do any captur-
5824 ing, and is not counted when computing the number of any subsequent
5825 capturing subpatterns. For example, if the string "the white queen" is
5826 matched against the pattern
5827
5828 the ((?:red|white) (king|queen))
5829
5830 the captured substrings are "white queen" and "queen", and are numbered
5831 1 and 2. The maximum number of capturing subpatterns is 65535.
5832
5833 As a convenient shorthand, if any option settings are required at the
5834 start of a non-capturing subpattern, the option letters may appear
5835 between the "?" and the ":". Thus the two patterns
5836
5837 (?i:saturday|sunday)
5838 (?:(?i)saturday|sunday)
5839
5840 match exactly the same set of strings. Because alternative branches are
5841 tried from left to right, and options are not reset until the end of
5842 the subpattern is reached, an option setting in one branch does affect
5843 subsequent branches, so the above patterns match "SUNDAY" as well as
5844 "Saturday".
5845
5846
5847 DUPLICATE SUBPATTERN NUMBERS
5848
5849 Perl 5.10 introduced a feature whereby each alternative in a subpattern
5850 uses the same numbers for its capturing parentheses. Such a subpattern
5851 starts with (?| and is itself a non-capturing subpattern. For example,
5852 consider this pattern:
5853
5854 (?|(Sat)ur|(Sun))day
5855
5856 Because the two alternatives are inside a (?| group, both sets of cap-
5857 turing parentheses are numbered one. Thus, when the pattern matches,
5858 you can look at captured substring number one, whichever alternative
5859 matched. This construct is useful when you want to capture part, but
5860 not all, of one of a number of alternatives. Inside a (?| group, paren-
5861 theses are numbered as usual, but the number is reset at the start of
5862 each branch. The numbers of any capturing parentheses that follow the
5863 subpattern start after the highest number used in any branch. The fol-
5864 lowing example is taken from the Perl documentation. The numbers under-
5865 neath show in which buffer the captured content will be stored.
5866
5867 # before ---------------branch-reset----------- after
5868 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5869 # 1 2 2 3 2 3 4
5870
5871 A back reference to a numbered subpattern uses the most recent value
5872 that is set for that number by any subpattern. The following pattern
5873 matches "abcabc" or "defdef":
5874
5875 /(?|(abc)|(def))\1/
5876
5877 In contrast, a subroutine call to a numbered subpattern always refers
5878 to the first one in the pattern with the given number. The following
5879 pattern matches "abcabc" or "defabc":
5880
5881 /(?|(abc)|(def))(?1)/
5882
5883 If a condition test for a subpattern's having matched refers to a non-
5884 unique number, the test is true if any of the subpatterns of that num-
5885 ber have matched.
5886
5887 An alternative approach to using this "branch reset" feature is to use
5888 duplicate named subpatterns, as described in the next section.
5889
5890
5891 NAMED SUBPATTERNS
5892
5893 Identifying capturing parentheses by number is simple, but it can be
5894 very hard to keep track of the numbers in complicated regular expres-
5895 sions. Furthermore, if an expression is modified, the numbers may
5896 change. To help with this difficulty, PCRE supports the naming of sub-
5897 patterns. This feature was not added to Perl until release 5.10. Python
5898 had the feature earlier, and PCRE introduced it at release 4.0, using
5899 the Python syntax. PCRE now supports both the Perl and the Python syn-
5900 tax. Perl allows identically numbered subpatterns to have different
5901 names, but PCRE does not.
5902
5903 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
5904 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
5905 to capturing parentheses from other parts of the pattern, such as back
5906 references, recursion, and conditions, can be made by name as well as
5907 by number.
5908
5909 Names consist of up to 32 alphanumeric characters and underscores.
5910 Named capturing parentheses are still allocated numbers as well as
5911 names, exactly as if the names were not present. The PCRE API provides
5912 function calls for extracting the name-to-number translation table from
5913 a compiled pattern. There is also a convenience function for extracting
5914 a captured substring by name.
5915
5916 By default, a name must be unique within a pattern, but it is possible
5917 to relax this constraint by setting the PCRE_DUPNAMES option at compile
5918 time. (Duplicate names are also always permitted for subpatterns with
5919 the same number, set up as described in the previous section.) Dupli-
5920 cate names can be useful for patterns where only one instance of the
5921 named parentheses can match. Suppose you want to match the name of a
5922 weekday, either as a 3-letter abbreviation or as the full name, and in
5923 both cases you want to extract the abbreviation. This pattern (ignoring
5924 the line breaks) does the job:
5925
5926 (?<DN>Mon|Fri|Sun)(?:day)?|
5927 (?<DN>Tue)(?:sday)?|
5928 (?<DN>Wed)(?:nesday)?|
5929 (?<DN>Thu)(?:rsday)?|
5930 (?<DN>Sat)(?:urday)?
5931
5932 There are five capturing substrings, but only one is ever set after a
5933 match. (An alternative way of solving this problem is to use a "branch
5934 reset" subpattern, as described in the previous section.)
5935
5936 The convenience function for extracting the data by name returns the
5937 substring for the first (and in this example, the only) subpattern of
5938 that name that matched. This saves searching to find which numbered
5939 subpattern it was.
5940
5941 If you make a back reference to a non-unique named subpattern from
5942 elsewhere in the pattern, the one that corresponds to the first occur-
5943 rence of the name is used. In the absence of duplicate numbers (see the
5944 previous section) this is the one with the lowest number. If you use a
5945 named reference in a condition test (see the section about conditions
5946 below), either to check whether a subpattern has matched, or to check
5947 for recursion, all subpatterns with the same name are tested. If the
5948 condition is true for any one of them, the overall condition is true.
5949 This is the same behaviour as testing by number. For further details of
5950 the interfaces for handling named subpatterns, see the pcreapi documen-
5951 tation.
5952
5953 Warning: You cannot use different names to distinguish between two sub-
5954 patterns with the same number because PCRE uses only the numbers when
5955 matching. For this reason, an error is given at compile time if differ-
5956 ent names are given to subpatterns with the same number. However, you
5957 can give the same name to subpatterns with the same number, even when
5958 PCRE_DUPNAMES is not set.
5959
5960
5961 REPETITION
5962
5963 Repetition is specified by quantifiers, which can follow any of the
5964 following items:
5965
5966 a literal data character
5967 the dot metacharacter
5968 the \C escape sequence
5969 the \X escape sequence
5970 the \R escape sequence
5971 an escape such as \d or \pL that matches a single character
5972 a character class
5973 a back reference (see next section)
5974 a parenthesized subpattern (including assertions)
5975 a subroutine call to a subpattern (recursive or otherwise)
5976
5977 The general repetition quantifier specifies a minimum and maximum num-
5978 ber of permitted matches, by giving the two numbers in curly brackets
5979 (braces), separated by a comma. The numbers must be less than 65536,
5980 and the first must be less than or equal to the second. For example:
5981
5982 z{2,4}
5983
5984 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
5985 special character. If the second number is omitted, but the comma is
5986 present, there is no upper limit; if the second number and the comma
5987 are both omitted, the quantifier specifies an exact number of required
5988 matches. Thus
5989
5990 [aeiou]{3,}
5991
5992 matches at least 3 successive vowels, but may match many more, while
5993
5994 \d{8}
5995
5996 matches exactly 8 digits. An opening curly bracket that appears in a
5997 position where a quantifier is not allowed, or one that does not match
5998 the syntax of a quantifier, is taken as a literal character. For exam-
5999 ple, {,6} is not a quantifier, but a literal string of four characters.
6000
6001 In UTF modes, quantifiers apply to characters rather than to individual
6002 data units. Thus, for example, \x{100}{2} matches two characters, each
6003 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
6004 larly, \X{3} matches three Unicode extended grapheme clusters, each of
6005 which may be several data units long (and they may be of different
6006 lengths).
6007
6008 The quantifier {0} is permitted, causing the expression to behave as if
6009 the previous item and the quantifier were not present. This may be use-
6010 ful for subpatterns that are referenced as subroutines from elsewhere
6011 in the pattern (but see also the section entitled "Defining subpatterns
6012 for use by reference only" below). Items other than subpatterns that
6013 have a {0} quantifier are omitted from the compiled pattern.
6014
6015 For convenience, the three most common quantifiers have single-charac-
6016 ter abbreviations:
6017
6018 * is equivalent to {0,}
6019 + is equivalent to {1,}
6020 ? is equivalent to {0,1}
6021
6022 It is possible to construct infinite loops by following a subpattern
6023 that can match no characters with a quantifier that has no upper limit,
6024 for example:
6025
6026 (a?)*
6027
6028 Earlier versions of Perl and PCRE used to give an error at compile time
6029 for such patterns. However, because there are cases where this can be
6030 useful, such patterns are now accepted, but if any repetition of the
6031 subpattern does in fact match no characters, the loop is forcibly bro-
6032 ken.
6033
6034 By default, the quantifiers are "greedy", that is, they match as much
6035 as possible (up to the maximum number of permitted times), without
6036 causing the rest of the pattern to fail. The classic example of where
6037 this gives problems is in trying to match comments in C programs. These
6038 appear between /* and */ and within the comment, individual * and /
6039 characters may appear. An attempt to match C comments by applying the
6040 pattern