/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1298 - (show annotations)
Fri Mar 22 16:13:13 2013 UTC (6 years, 4 months ago) by ph10
File MIME type: text/plain
File size: 483295 byte(s)
Fix COMMIT in recursion; document backtracking verbs in assertions and 
subroutines.
</
1 -----------------------------------------------------------------------------
2 This file contains a concatenation of the PCRE man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
5 synopses of each function in the library have not been included. Neither has
6 the pcredemo program. There are separate text files for the pcregrep and
7 pcretest commands.
8 -----------------------------------------------------------------------------
9
10
11 PCRE(3) Library Functions Manual PCRE(3)
12
13
14
15 NAME
16 PCRE - Perl-compatible regular expressions
17
18 INTRODUCTION
19
20 The PCRE library is a set of functions that implement regular expres-
21 sion pattern matching using the same syntax and semantics as Perl, with
22 just a few differences. Some features that appeared in Python and PCRE
23 before they appeared in Perl are also available using the Python syn-
24 tax, there is some support for one or two .NET and Oniguruma syntax
25 items, and there is an option for requesting some minor changes that
26 give better JavaScript compatibility.
27
28 Starting with release 8.30, it is possible to compile two separate PCRE
29 libraries: the original, which supports 8-bit character strings
30 (including UTF-8 strings), and a second library that supports 16-bit
31 character strings (including UTF-16 strings). The build process allows
32 either one or both to be built. The majority of the work to make this
33 possible was done by Zoltan Herczeg.
34
35 Starting with release 8.32 it is possible to compile a third separate
36 PCRE library, which supports 32-bit character strings (including UTF-32
37 strings). The build process allows any set of the 8-, 16- and 32-bit
38 libraries. The work to make this possible was done by Christian Persch.
39
40 The three libraries contain identical sets of functions, except that
41 the names in the 16-bit library start with pcre16_ instead of pcre_,
42 and the names in the 32-bit library start with pcre32_ instead of
43 pcre_. To avoid over-complication and reduce the documentation mainte-
44 nance load, most of the documentation describes the 8-bit library, with
45 the differences for the 16-bit and 32-bit libraries described sepa-
46 rately in the pcre16 and pcre32 pages. References to functions or
47 structures of the form pcre[16|32]_xxx should be read as meaning
48 "pcre_xxx when using the 8-bit library, pcre16_xxx when using the
49 16-bit library, or pcre32_xxx when using the 32-bit library".
50
51 The current implementation of PCRE corresponds approximately with Perl
52 5.12, including support for UTF-8/16/32 encoded strings and Unicode
53 general category properties. However, UTF-8/16/32 and Unicode support
54 has to be explicitly enabled; it is not the default. The Unicode tables
55 correspond to Unicode release 6.2.0.
56
57 In addition to the Perl-compatible matching function, PCRE contains an
58 alternative function that matches the same compiled patterns in a dif-
59 ferent way. In certain circumstances, the alternative function has some
60 advantages. For a discussion of the two matching algorithms, see the
61 pcrematching page.
62
63 PCRE is written in C and released as a C library. A number of people
64 have written wrappers and interfaces of various kinds. In particular,
65 Google Inc. have provided a comprehensive C++ wrapper for the 8-bit
66 library. This is now included as part of the PCRE distribution. The
67 pcrecpp page has details of this interface. Other people's contribu-
68 tions can be found in the Contrib directory at the primary FTP site,
69 which is:
70
71 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
72
73 Details of exactly which Perl regular expression features are and are
74 not supported by PCRE are given in separate documents. See the pcrepat-
75 tern and pcrecompat pages. There is a syntax summary in the pcresyntax
76 page.
77
78 Some features of PCRE can be included, excluded, or changed when the
79 library is built. The pcre_config() function makes it possible for a
80 client to discover which features are available. The features them-
81 selves are described in the pcrebuild page. Documentation about build-
82 ing PCRE for various operating systems can be found in the README and
83 NON-AUTOTOOLS_BUILD files in the source distribution.
84
85 The libraries contains a number of undocumented internal functions and
86 data tables that are used by more than one of the exported external
87 functions, but which are not intended for use by external callers.
88 Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which
89 hopefully will not provoke any name clashes. In some environments, it
90 is possible to control which external symbols are exported when a
91 shared library is built, and in these cases the undocumented symbols
92 are not exported.
93
94
95 SECURITY CONSIDERATIONS
96
97 If you are using PCRE in a non-UTF application that permits users to
98 supply arbitrary patterns for compilation, you should be aware of a
99 feature that allows users to turn on UTF support from within a pattern,
100 provided that PCRE was built with UTF support. For example, an 8-bit
101 pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
102 which interprets patterns and subjects as strings of UTF-8 characters
103 instead of individual 8-bit characters. This causes both the pattern
104 and any data against which it is matched to be checked for UTF-8 valid-
105 ity. If the data string is very long, such a check might use suffi-
106 ciently many resources as to cause your application to lose perfor-
107 mance.
108
109 The best way of guarding against this possibility is to use the
110 pcre_fullinfo() function to check the compiled pattern's options for
111 UTF.
112
113 If your application is one that supports UTF, be aware that validity
114 checking can take time. If the same data string is to be matched many
115 times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
116 and subsequent matches to save redundant checks.
117
118 Another way that performance can be hit is by running a pattern that
119 has a very large search tree against a string that will never match.
120 Nested unlimited repeats in a pattern are a common example. PCRE pro-
121 vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
122 ture in the pcreapi page.
123
124
125 USER DOCUMENTATION
126
127 The user documentation for PCRE comprises a number of different sec-
128 tions. In the "man" format, each of these is a separate "man page". In
129 the HTML format, each is a separate page, linked from the index page.
130 In the plain text format, all the sections, except the pcredemo sec-
131 tion, are concatenated, for ease of searching. The sections are as fol-
132 lows:
133
134 pcre this document
135 pcre16 details of the 16-bit library
136 pcre32 details of the 32-bit library
137 pcre-config show PCRE installation configuration information
138 pcreapi details of PCRE's native C API
139 pcrebuild options for building PCRE
140 pcrecallout details of the callout feature
141 pcrecompat discussion of Perl compatibility
142 pcrecpp details of the C++ wrapper for the 8-bit library
143 pcredemo a demonstration C program that uses PCRE
144 pcregrep description of the pcregrep command (8-bit only)
145 pcrejit discussion of the just-in-time optimization support
146 pcrelimits details of size and other limits
147 pcrematching discussion of the two matching algorithms
148 pcrepartial details of the partial matching facility
149 pcrepattern syntax and semantics of supported
150 regular expressions
151 pcreperform discussion of performance issues
152 pcreposix the POSIX-compatible C API for the 8-bit library
153 pcreprecompile details of saving and re-using precompiled patterns
154 pcresample discussion of the pcredemo program
155 pcrestack discussion of stack usage
156 pcresyntax quick syntax reference
157 pcretest description of the pcretest testing command
158 pcreunicode discussion of Unicode and UTF-8/16/32 support
159
160 In addition, in the "man" and HTML formats, there is a short page for
161 each C library function, listing its arguments and results.
162
163
164 AUTHOR
165
166 Philip Hazel
167 University Computing Service
168 Cambridge CB2 3QH, England.
169
170 Putting an actual email address here seems to have been a spam magnet,
171 so I've taken it away. If you want to email me, use my two initials,
172 followed by the two digits 10, at the domain cam.ac.uk.
173
174
175 REVISION
176
177 Last updated: 11 November 2012
178 Copyright (c) 1997-2012 University of Cambridge.
179 ------------------------------------------------------------------------------
180
181
182 PCRE(3) Library Functions Manual PCRE(3)
183
184
185
186 NAME
187 PCRE - Perl-compatible regular expressions
188
189 #include <pcre.h>
190
191
192 PCRE 16-BIT API BASIC FUNCTIONS
193
194 pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
195 const char **errptr, int *erroffset,
196 const unsigned char *tableptr);
197
198 pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
199 int *errorcodeptr,
200 const char **errptr, int *erroffset,
201 const unsigned char *tableptr);
202
203 pcre16_extra *pcre16_study(const pcre16 *code, int options,
204 const char **errptr);
205
206 void pcre16_free_study(pcre16_extra *extra);
207
208 int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
209 PCRE_SPTR16 subject, int length, int startoffset,
210 int options, int *ovector, int ovecsize);
211
212 int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
213 PCRE_SPTR16 subject, int length, int startoffset,
214 int options, int *ovector, int ovecsize,
215 int *workspace, int wscount);
216
217
218 PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
219
220 int pcre16_copy_named_substring(const pcre16 *code,
221 PCRE_SPTR16 subject, int *ovector,
222 int stringcount, PCRE_SPTR16 stringname,
223 PCRE_UCHAR16 *buffer, int buffersize);
224
225 int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
226 int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
227 int buffersize);
228
229 int pcre16_get_named_substring(const pcre16 *code,
230 PCRE_SPTR16 subject, int *ovector,
231 int stringcount, PCRE_SPTR16 stringname,
232 PCRE_SPTR16 *stringptr);
233
234 int pcre16_get_stringnumber(const pcre16 *code,
235 PCRE_SPTR16 name);
236
237 int pcre16_get_stringtable_entries(const pcre16 *code,
238 PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
239
240 int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
241 int stringcount, int stringnumber,
242 PCRE_SPTR16 *stringptr);
243
244 int pcre16_get_substring_list(PCRE_SPTR16 subject,
245 int *ovector, int stringcount, PCRE_SPTR16 **listptr);
246
247 void pcre16_free_substring(PCRE_SPTR16 stringptr);
248
249 void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
250
251
252 PCRE 16-BIT API AUXILIARY FUNCTIONS
253
254 pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
255
256 void pcre16_jit_stack_free(pcre16_jit_stack *stack);
257
258 void pcre16_assign_jit_stack(pcre16_extra *extra,
259 pcre16_jit_callback callback, void *data);
260
261 const unsigned char *pcre16_maketables(void);
262
263 int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
264 int what, void *where);
265
266 int pcre16_refcount(pcre16 *code, int adjust);
267
268 int pcre16_config(int what, void *where);
269
270 const char *pcre16_version(void);
271
272 int pcre16_pattern_to_host_byte_order(pcre16 *code,
273 pcre16_extra *extra, const unsigned char *tables);
274
275
276 PCRE 16-BIT API INDIRECTED FUNCTIONS
277
278 void *(*pcre16_malloc)(size_t);
279
280 void (*pcre16_free)(void *);
281
282 void *(*pcre16_stack_malloc)(size_t);
283
284 void (*pcre16_stack_free)(void *);
285
286 int (*pcre16_callout)(pcre16_callout_block *);
287
288
289 PCRE 16-BIT API 16-BIT-ONLY FUNCTION
290
291 int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
292 PCRE_SPTR16 input, int length, int *byte_order,
293 int keep_boms);
294
295
296 THE PCRE 16-BIT LIBRARY
297
298 Starting with release 8.30, it is possible to compile a PCRE library
299 that supports 16-bit character strings, including UTF-16 strings, as
300 well as or instead of the original 8-bit library. The majority of the
301 work to make this possible was done by Zoltan Herczeg. The two
302 libraries contain identical sets of functions, used in exactly the same
303 way. Only the names of the functions and the data types of their argu-
304 ments and results are different. To avoid over-complication and reduce
305 the documentation maintenance load, most of the PCRE documentation
306 describes the 8-bit library, with only occasional references to the
307 16-bit library. This page describes what is different when you use the
308 16-bit library.
309
310 WARNING: A single application can be linked with both libraries, but
311 you must take care when processing any particular pattern to use func-
312 tions from just one library. For example, if you want to study a pat-
313 tern that was compiled with pcre16_compile(), you must do so with
314 pcre16_study(), not pcre_study(), and you must free the study data with
315 pcre16_free_study().
316
317
318 THE HEADER FILE
319
320 There is only one header file, pcre.h. It contains prototypes for all
321 the functions in all libraries, as well as definitions of flags, struc-
322 tures, error codes, etc.
323
324
325 THE LIBRARY NAME
326
327 In Unix-like systems, the 16-bit library is called libpcre16, and can
328 normally be accesss by adding -lpcre16 to the command for linking an
329 application that uses PCRE.
330
331
332 STRING TYPES
333
334 In the 8-bit library, strings are passed to PCRE library functions as
335 vectors of bytes with the C type "char *". In the 16-bit library,
336 strings are passed as vectors of unsigned 16-bit quantities. The macro
337 PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is
338 defined as "const PCRE_UCHAR16 *". In very many environments, "short
339 int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
340 as "unsigned short int", but checks that it really is a 16-bit data
341 type. If it is not, the build fails with an error message telling the
342 maintainer to modify the definition appropriately.
343
344
345 STRUCTURE TYPES
346
347 The types of the opaque structures that are used for compiled 16-bit
348 patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively.
349 The type of the user-accessible structure that is returned by
350 pcre16_study() is pcre16_extra, and the type of the structure that is
351 used for passing data to a callout function is pcre16_callout_block.
352 These structures contain the same fields, with the same names, as their
353 8-bit counterparts. The only difference is that pointers to character
354 strings are 16-bit instead of 8-bit types.
355
356
357 16-BIT FUNCTIONS
358
359 For every function in the 8-bit library there is a corresponding func-
360 tion in the 16-bit library with a name that starts with pcre16_ instead
361 of pcre_. The prototypes are listed above. In addition, there is one
362 extra function, pcre16_utf16_to_host_byte_order(). This is a utility
363 function that converts a UTF-16 character string to host byte order if
364 necessary. The other 16-bit functions expect the strings they are
365 passed to be in host byte order.
366
367 The input and output arguments of pcre16_utf16_to_host_byte_order() may
368 point to the same address, that is, conversion in place is supported.
369 The output buffer must be at least as long as the input.
370
371 The length argument specifies the number of 16-bit data units in the
372 input string; a negative value specifies a zero-terminated string.
373
374 If byte_order is NULL, it is assumed that the string starts off in host
375 byte order. This may be changed by byte-order marks (BOMs) anywhere in
376 the string (commonly as the first character).
377
378 If byte_order is not NULL, a non-zero value of the integer to which it
379 points means that the input starts off in host byte order, otherwise
380 the opposite order is assumed. Again, BOMs in the string can change
381 this. The final byte order is passed back at the end of processing.
382
383 If keep_boms is not zero, byte-order mark characters (0xfeff) are
384 copied into the output string. Otherwise they are discarded.
385
386 The result of the function is the number of 16-bit units placed into
387 the output buffer, including the zero terminator if the string was
388 zero-terminated.
389
390
391 SUBJECT STRING OFFSETS
392
393 The offsets within subject strings that are returned by the matching
394 functions are in 16-bit units rather than bytes.
395
396
397 NAMED SUBPATTERNS
398
399 The name-to-number translation table that is maintained for named sub-
400 patterns uses 16-bit characters. The pcre16_get_stringtable_entries()
401 function returns the length of each entry in the table as the number of
402 16-bit data units.
403
404
405 OPTION NAMES
406
407 There are two new general option names, PCRE_UTF16 and
408 PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and
409 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
410 define the same bits in the options word. There is a discussion about
411 the validity of UTF-16 strings in the pcreunicode page.
412
413 For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
414 that returns 1 if UTF-16 support is configured, otherwise 0. If this
415 option is given to pcre_config() or pcre32_config(), or if the
416 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con-
417 fig(), the result is the PCRE_ERROR_BADOPTION error.
418
419
420 CHARACTER CODES
421
422 In 16-bit mode, when PCRE_UTF16 is not set, character values are
423 treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
424 that they can range from 0 to 0xffff instead of 0 to 0xff. Character
425 types for characters less than 0xff can therefore be influenced by the
426 locale in the same way as before. Characters greater than 0xff have
427 only one case, and no "type" (such as letter or digit).
428
429 In UTF-16 mode, the character code is Unicode, in the range 0 to
430 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
431 because those are "surrogate" values that are used in pairs to encode
432 values greater than 0xffff.
433
434 A UTF-16 string can indicate its endianness by special code knows as a
435 byte-order mark (BOM). The PCRE functions do not handle this, expecting
436 strings to be in host byte order. A utility function called
437 pcre16_utf16_to_host_byte_order() is provided to help with this (see
438 above).
439
440
441 ERROR NAMES
442
443 The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
444 spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
445 given when a compiled pattern is passed to a function that processes
446 patterns in the other mode, for example, if a pattern compiled with
447 pcre_compile() is passed to pcre16_exec().
448
449 There are new error codes whose names begin with PCRE_UTF16_ERR for
450 invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
451 UTF-8 strings that are described in the section entitled "Reason codes
452 for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
453 are:
454
455 PCRE_UTF16_ERR1 Missing low surrogate at end of string
456 PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
457 PCRE_UTF16_ERR3 Isolated low surrogate
458 PCRE_UTF16_ERR4 Non-character
459
460
461 ERROR TEXTS
462
463 If there is an error while compiling a pattern, the error text that is
464 passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit
465 character string, zero-terminated.
466
467
468 CALLOUTS
469
470 The subject and mark fields in the callout block that is passed to a
471 callout function point to 16-bit vectors.
472
473
474 TESTING
475
476 The pcretest program continues to operate with 8-bit input and output
477 files, but it can be used for testing the 16-bit library. If it is run
478 with the command line option -16, patterns and subject strings are con-
479 verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
480 library functions are used instead of the 8-bit ones. Returned 16-bit
481 strings are converted to 8-bit for output. If both the 8-bit and the
482 32-bit libraries were not compiled, pcretest defaults to 16-bit and the
483 -16 option is ignored.
484
485 When PCRE is being built, the RunTest script that is called by "make
486 check" uses the pcretest -C option to discover which of the 8-bit,
487 16-bit and 32-bit libraries has been built, and runs the tests appro-
488 priately.
489
490
491 NOT SUPPORTED IN 16-BIT MODE
492
493 Not all the features of the 8-bit library are available with the 16-bit
494 library. The C++ and POSIX wrapper functions support only the 8-bit
495 library, and the pcregrep program is at present 8-bit only.
496
497
498 AUTHOR
499
500 Philip Hazel
501 University Computing Service
502 Cambridge CB2 3QH, England.
503
504
505 REVISION
506
507 Last updated: 08 November 2012
508 Copyright (c) 1997-2012 University of Cambridge.
509 ------------------------------------------------------------------------------
510
511
512 PCRE(3) Library Functions Manual PCRE(3)
513
514
515
516 NAME
517 PCRE - Perl-compatible regular expressions
518
519 #include <pcre.h>
520
521
522 PCRE 32-BIT API BASIC FUNCTIONS
523
524 pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
525 const char **errptr, int *erroffset,
526 const unsigned char *tableptr);
527
528 pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
529 int *errorcodeptr,
530 const char **errptr, int *erroffset,
531 const unsigned char *tableptr);
532
533 pcre32_extra *pcre32_study(const pcre32 *code, int options,
534 const char **errptr);
535
536 void pcre32_free_study(pcre32_extra *extra);
537
538 int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
539 PCRE_SPTR32 subject, int length, int startoffset,
540 int options, int *ovector, int ovecsize);
541
542 int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
543 PCRE_SPTR32 subject, int length, int startoffset,
544 int options, int *ovector, int ovecsize,
545 int *workspace, int wscount);
546
547
548 PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
549
550 int pcre32_copy_named_substring(const pcre32 *code,
551 PCRE_SPTR32 subject, int *ovector,
552 int stringcount, PCRE_SPTR32 stringname,
553 PCRE_UCHAR32 *buffer, int buffersize);
554
555 int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
556 int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
557 int buffersize);
558
559 int pcre32_get_named_substring(const pcre32 *code,
560 PCRE_SPTR32 subject, int *ovector,
561 int stringcount, PCRE_SPTR32 stringname,
562 PCRE_SPTR32 *stringptr);
563
564 int pcre32_get_stringnumber(const pcre32 *code,
565 PCRE_SPTR32 name);
566
567 int pcre32_get_stringtable_entries(const pcre32 *code,
568 PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
569
570 int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
571 int stringcount, int stringnumber,
572 PCRE_SPTR32 *stringptr);
573
574 int pcre32_get_substring_list(PCRE_SPTR32 subject,
575 int *ovector, int stringcount, PCRE_SPTR32 **listptr);
576
577 void pcre32_free_substring(PCRE_SPTR32 stringptr);
578
579 void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
580
581
582 PCRE 32-BIT API AUXILIARY FUNCTIONS
583
584 pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
585
586 void pcre32_jit_stack_free(pcre32_jit_stack *stack);
587
588 void pcre32_assign_jit_stack(pcre32_extra *extra,
589 pcre32_jit_callback callback, void *data);
590
591 const unsigned char *pcre32_maketables(void);
592
593 int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
594 int what, void *where);
595
596 int pcre32_refcount(pcre32 *code, int adjust);
597
598 int pcre32_config(int what, void *where);
599
600 const char *pcre32_version(void);
601
602 int pcre32_pattern_to_host_byte_order(pcre32 *code,
603 pcre32_extra *extra, const unsigned char *tables);
604
605
606 PCRE 32-BIT API INDIRECTED FUNCTIONS
607
608 void *(*pcre32_malloc)(size_t);
609
610 void (*pcre32_free)(void *);
611
612 void *(*pcre32_stack_malloc)(size_t);
613
614 void (*pcre32_stack_free)(void *);
615
616 int (*pcre32_callout)(pcre32_callout_block *);
617
618
619 PCRE 32-BIT API 32-BIT-ONLY FUNCTION
620
621 int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
622 PCRE_SPTR32 input, int length, int *byte_order,
623 int keep_boms);
624
625
626 THE PCRE 32-BIT LIBRARY
627
628 Starting with release 8.32, it is possible to compile a PCRE library
629 that supports 32-bit character strings, including UTF-32 strings, as
630 well as or instead of the original 8-bit library. This work was done by
631 Christian Persch, based on the work done by Zoltan Herczeg for the
632 16-bit library. All three libraries contain identical sets of func-
633 tions, used in exactly the same way. Only the names of the functions
634 and the data types of their arguments and results are different. To
635 avoid over-complication and reduce the documentation maintenance load,
636 most of the PCRE documentation describes the 8-bit library, with only
637 occasional references to the 16-bit and 32-bit libraries. This page
638 describes what is different when you use the 32-bit library.
639
640 WARNING: A single application can be linked with all or any of the
641 three libraries, but you must take care when processing any particular
642 pattern to use functions from just one library. For example, if you
643 want to study a pattern that was compiled with pcre32_compile(), you
644 must do so with pcre32_study(), not pcre_study(), and you must free the
645 study data with pcre32_free_study().
646
647
648 THE HEADER FILE
649
650 There is only one header file, pcre.h. It contains prototypes for all
651 the functions in all libraries, as well as definitions of flags, struc-
652 tures, error codes, etc.
653
654
655 THE LIBRARY NAME
656
657 In Unix-like systems, the 32-bit library is called libpcre32, and can
658 normally be accesss by adding -lpcre32 to the command for linking an
659 application that uses PCRE.
660
661
662 STRING TYPES
663
664 In the 8-bit library, strings are passed to PCRE library functions as
665 vectors of bytes with the C type "char *". In the 32-bit library,
666 strings are passed as vectors of unsigned 32-bit quantities. The macro
667 PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is
668 defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
669 int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
670 as "unsigned int", but checks that it really is a 32-bit data type. If
671 it is not, the build fails with an error message telling the maintainer
672 to modify the definition appropriately.
673
674
675 STRUCTURE TYPES
676
677 The types of the opaque structures that are used for compiled 32-bit
678 patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively.
679 The type of the user-accessible structure that is returned by
680 pcre32_study() is pcre32_extra, and the type of the structure that is
681 used for passing data to a callout function is pcre32_callout_block.
682 These structures contain the same fields, with the same names, as their
683 8-bit counterparts. The only difference is that pointers to character
684 strings are 32-bit instead of 8-bit types.
685
686
687 32-BIT FUNCTIONS
688
689 For every function in the 8-bit library there is a corresponding func-
690 tion in the 32-bit library with a name that starts with pcre32_ instead
691 of pcre_. The prototypes are listed above. In addition, there is one
692 extra function, pcre32_utf32_to_host_byte_order(). This is a utility
693 function that converts a UTF-32 character string to host byte order if
694 necessary. The other 32-bit functions expect the strings they are
695 passed to be in host byte order.
696
697 The input and output arguments of pcre32_utf32_to_host_byte_order() may
698 point to the same address, that is, conversion in place is supported.
699 The output buffer must be at least as long as the input.
700
701 The length argument specifies the number of 32-bit data units in the
702 input string; a negative value specifies a zero-terminated string.
703
704 If byte_order is NULL, it is assumed that the string starts off in host
705 byte order. This may be changed by byte-order marks (BOMs) anywhere in
706 the string (commonly as the first character).
707
708 If byte_order is not NULL, a non-zero value of the integer to which it
709 points means that the input starts off in host byte order, otherwise
710 the opposite order is assumed. Again, BOMs in the string can change
711 this. The final byte order is passed back at the end of processing.
712
713 If keep_boms is not zero, byte-order mark characters (0xfeff) are
714 copied into the output string. Otherwise they are discarded.
715
716 The result of the function is the number of 32-bit units placed into
717 the output buffer, including the zero terminator if the string was
718 zero-terminated.
719
720
721 SUBJECT STRING OFFSETS
722
723 The offsets within subject strings that are returned by the matching
724 functions are in 32-bit units rather than bytes.
725
726
727 NAMED SUBPATTERNS
728
729 The name-to-number translation table that is maintained for named sub-
730 patterns uses 32-bit characters. The pcre32_get_stringtable_entries()
731 function returns the length of each entry in the table as the number of
732 32-bit data units.
733
734
735 OPTION NAMES
736
737 There are two new general option names, PCRE_UTF32 and
738 PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and
739 PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
740 define the same bits in the options word. There is a discussion about
741 the validity of UTF-32 strings in the pcreunicode page.
742
743 For the pcre32_config() function there is an option PCRE_CONFIG_UTF32
744 that returns 1 if UTF-32 support is configured, otherwise 0. If this
745 option is given to pcre_config() or pcre16_config(), or if the
746 PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con-
747 fig(), the result is the PCRE_ERROR_BADOPTION error.
748
749
750 CHARACTER CODES
751
752 In 32-bit mode, when PCRE_UTF32 is not set, character values are
753 treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
754 that they can range from 0 to 0x7fffffff instead of 0 to 0xff. Charac-
755 ter types for characters less than 0xff can therefore be influenced by
756 the locale in the same way as before. Characters greater than 0xff
757 have only one case, and no "type" (such as letter or digit).
758
759 In UTF-32 mode, the character code is Unicode, in the range 0 to
760 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
761 because those are "surrogate" values that are ill-formed in UTF-32.
762
763 A UTF-32 string can indicate its endianness by special code knows as a
764 byte-order mark (BOM). The PCRE functions do not handle this, expecting
765 strings to be in host byte order. A utility function called
766 pcre32_utf32_to_host_byte_order() is provided to help with this (see
767 above).
768
769
770 ERROR NAMES
771
772 The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart.
773 The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
774 to a function that processes patterns in the other mode, for example,
775 if a pattern compiled with pcre_compile() is passed to pcre32_exec().
776
777 There are new error codes whose names begin with PCRE_UTF32_ERR for
778 invalid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for
779 UTF-8 strings that are described in the section entitled "Reason codes
780 for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
781 are:
782
783 PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
784 PCRE_UTF32_ERR2 Non-character
785 PCRE_UTF32_ERR3 Character > 0x10ffff
786
787
788 ERROR TEXTS
789
790 If there is an error while compiling a pattern, the error text that is
791 passed back by pcre32_compile() or pcre32_compile2() is still an 8-bit
792 character string, zero-terminated.
793
794
795 CALLOUTS
796
797 The subject and mark fields in the callout block that is passed to a
798 callout function point to 32-bit vectors.
799
800
801 TESTING
802
803 The pcretest program continues to operate with 8-bit input and output
804 files, but it can be used for testing the 32-bit library. If it is run
805 with the command line option -32, patterns and subject strings are con-
806 verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
807 library functions are used instead of the 8-bit ones. Returned 32-bit
808 strings are converted to 8-bit for output. If both the 8-bit and the
809 16-bit libraries were not compiled, pcretest defaults to 32-bit and the
810 -32 option is ignored.
811
812 When PCRE is being built, the RunTest script that is called by "make
813 check" uses the pcretest -C option to discover which of the 8-bit,
814 16-bit and 32-bit libraries has been built, and runs the tests appro-
815 priately.
816
817
818 NOT SUPPORTED IN 32-BIT MODE
819
820 Not all the features of the 8-bit library are available with the 32-bit
821 library. The C++ and POSIX wrapper functions support only the 8-bit
822 library, and the pcregrep program is at present 8-bit only.
823
824
825 AUTHOR
826
827 Philip Hazel
828 University Computing Service
829 Cambridge CB2 3QH, England.
830
831
832 REVISION
833
834 Last updated: 08 November 2012
835 Copyright (c) 1997-2012 University of Cambridge.
836 ------------------------------------------------------------------------------
837
838
839 PCREBUILD(3) Library Functions Manual PCREBUILD(3)
840
841
842
843 NAME
844 PCRE - Perl-compatible regular expressions
845
846 PCRE BUILD-TIME OPTIONS
847
848 This document describes the optional features of PCRE that can be
849 selected when the library is compiled. It assumes use of the configure
850 script, where the optional features are selected or deselected by pro-
851 viding options to configure before running the make command. However,
852 the same options can be selected in both Unix-like and non-Unix-like
853 environments using the GUI facility of cmake-gui if you are using CMake
854 instead of configure to build PCRE.
855
856 There is a lot more information about building PCRE without using con-
857 figure (including information about using CMake or building "by hand")
858 in the file called NON-AUTOTOOLS-BUILD, which is part of the PCRE dis-
859 tribution. You should consult this file as well as the README file if
860 you are building in a non-Unix-like environment.
861
862 The complete list of options for configure (which includes the standard
863 ones such as the selection of the installation directory) can be
864 obtained by running
865
866 ./configure --help
867
868 The following sections include descriptions of options whose names
869 begin with --enable or --disable. These settings specify changes to the
870 defaults for the configure command. Because of the way that configure
871 works, --enable and --disable always come in pairs, so the complemen-
872 tary option always exists as well, but as it specifies the default, it
873 is not described.
874
875
876 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
877
878 By default, a library called libpcre is built, containing functions
879 that take string arguments contained in vectors of bytes, either as
880 single-byte characters, or interpreted as UTF-8 strings. You can also
881 build a separate library, called libpcre16, in which strings are con-
882 tained in vectors of 16-bit data units and interpreted either as sin-
883 gle-unit characters or UTF-16 strings, by adding
884
885 --enable-pcre16
886
887 to the configure command. You can also build a separate library, called
888 libpcre32, in which strings are contained in vectors of 32-bit data
889 units and interpreted either as single-unit characters or UTF-32
890 strings, by adding
891
892 --enable-pcre32
893
894 to the configure command. If you do not want the 8-bit library, add
895
896 --disable-pcre8
897
898 as well. At least one of the three libraries must be built. Note that
899 the C++ and POSIX wrappers are for the 8-bit library only, and that
900 pcregrep is an 8-bit program. None of these are built if you select
901 only the 16-bit or 32-bit libraries.
902
903
904 BUILDING SHARED AND STATIC LIBRARIES
905
906 The PCRE building process uses libtool to build both shared and static
907 Unix libraries by default. You can suppress one of these by adding one
908 of
909
910 --disable-shared
911 --disable-static
912
913 to the configure command, as required.
914
915
916 C++ SUPPORT
917
918 By default, if the 8-bit library is being built, the configure script
919 will search for a C++ compiler and C++ header files. If it finds them,
920 it automatically builds the C++ wrapper library (which supports only
921 8-bit strings). You can disable this by adding
922
923 --disable-cpp
924
925 to the configure command.
926
927
928 UTF-8, UTF-16 AND UTF-32 SUPPORT
929
930 To build PCRE with support for UTF Unicode character strings, add
931
932 --enable-utf
933
934 to the configure command. This setting applies to all three libraries,
935 adding support for UTF-8 to the 8-bit library, support for UTF-16 to
936 the 16-bit library, and support for UTF-32 to the to the 32-bit
937 library. There are no separate options for enabling UTF-8, UTF-16 and
938 UTF-32 independently because that would allow ridiculous settings such
939 as requesting UTF-16 support while building only the 8-bit library. It
940 is not possible to build one library with UTF support and another with-
941 out in the same configuration. (For backwards compatibility, --enable-
942 utf8 is a synonym of --enable-utf.)
943
944 Of itself, this setting does not make PCRE treat strings as UTF-8,
945 UTF-16 or UTF-32. As well as compiling PCRE with this option, you also
946 have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as
947 appropriate) when you call one of the pattern compiling functions.
948
949 If you set --enable-utf when compiling in an EBCDIC environment, PCRE
950 expects its input to be either ASCII or UTF-8 (depending on the run-
951 time option). It is not possible to support both EBCDIC and UTF-8 codes
952 in the same version of the library. Consequently, --enable-utf and
953 --enable-ebcdic are mutually exclusive.
954
955
956 UNICODE CHARACTER PROPERTY SUPPORT
957
958 UTF support allows the libraries to process character codepoints up to
959 0x10ffff in the strings that they handle. On its own, however, it does
960 not provide any facilities for accessing the properties of such charac-
961 ters. If you want to be able to use the pattern escapes \P, \p, and \X,
962 which refer to Unicode character properties, you must add
963
964 --enable-unicode-properties
965
966 to the configure command. This implies UTF support, even if you have
967 not explicitly requested it.
968
969 Including Unicode property support adds around 30K of tables to the
970 PCRE library. Only the general category properties such as Lu and Nd
971 are supported. Details are given in the pcrepattern documentation.
972
973
974 JUST-IN-TIME COMPILER SUPPORT
975
976 Just-in-time compiler support is included in the build by specifying
977
978 --enable-jit
979
980 This support is available only for certain hardware architectures. If
981 this option is set for an unsupported architecture, a compile time
982 error occurs. See the pcrejit documentation for a discussion of JIT
983 usage. When JIT support is enabled, pcregrep automatically makes use of
984 it, unless you add
985
986 --disable-pcregrep-jit
987
988 to the "configure" command.
989
990
991 CODE VALUE OF NEWLINE
992
993 By default, PCRE interprets the linefeed (LF) character as indicating
994 the end of a line. This is the normal newline character on Unix-like
995 systems. You can compile PCRE to use carriage return (CR) instead, by
996 adding
997
998 --enable-newline-is-cr
999
1000 to the configure command. There is also a --enable-newline-is-lf
1001 option, which explicitly specifies linefeed as the newline character.
1002
1003 Alternatively, you can specify that line endings are to be indicated by
1004 the two character sequence CRLF. If you want this, add
1005
1006 --enable-newline-is-crlf
1007
1008 to the configure command. There is a fourth option, specified by
1009
1010 --enable-newline-is-anycrlf
1011
1012 which causes PCRE to recognize any of the three sequences CR, LF, or
1013 CRLF as indicating a line ending. Finally, a fifth option, specified by
1014
1015 --enable-newline-is-any
1016
1017 causes PCRE to recognize any Unicode newline sequence.
1018
1019 Whatever line ending convention is selected when PCRE is built can be
1020 overridden when the library functions are called. At build time it is
1021 conventional to use the standard for your operating system.
1022
1023
1024 WHAT \R MATCHES
1025
1026 By default, the sequence \R in a pattern matches any Unicode newline
1027 sequence, whatever has been selected as the line ending sequence. If
1028 you specify
1029
1030 --enable-bsr-anycrlf
1031
1032 the default is changed so that \R matches only CR, LF, or CRLF. What-
1033 ever is selected when PCRE is built can be overridden when the library
1034 functions are called.
1035
1036
1037 POSIX MALLOC USAGE
1038
1039 When the 8-bit library is called through the POSIX interface (see the
1040 pcreposix documentation), additional working storage is required for
1041 holding the pointers to capturing substrings, because PCRE requires
1042 three integers per substring, whereas the POSIX interface provides only
1043 two. If the number of expected substrings is small, the wrapper func-
1044 tion uses space on the stack, because this is faster than using mal-
1045 loc() for each call. The default threshold above which the stack is no
1046 longer used is 10; it can be changed by adding a setting such as
1047
1048 --with-posix-malloc-threshold=20
1049
1050 to the configure command.
1051
1052
1053 HANDLING VERY LARGE PATTERNS
1054
1055 Within a compiled pattern, offset values are used to point from one
1056 part to another (for example, from an opening parenthesis to an alter-
1057 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
1058 two-byte values are used for these offsets, leading to a maximum size
1059 for a compiled pattern of around 64K. This is sufficient to handle all
1060 but the most gigantic patterns. Nevertheless, some people do want to
1061 process truly enormous patterns, so it is possible to compile PCRE to
1062 use three-byte or four-byte offsets by adding a setting such as
1063
1064 --with-link-size=3
1065
1066 to the configure command. The value given must be 2, 3, or 4. For the
1067 16-bit library, a value of 3 is rounded up to 4. In these libraries,
1068 using longer offsets slows down the operation of PCRE because it has to
1069 load additional data when handling them. For the 32-bit library the
1070 value is always 4 and cannot be overridden; the value of --with-link-
1071 size is ignored.
1072
1073
1074 AVOIDING EXCESSIVE STACK USAGE
1075
1076 When matching with the pcre_exec() function, PCRE implements backtrack-
1077 ing by making recursive calls to an internal function called match().
1078 In environments where the size of the stack is limited, this can se-
1079 verely limit PCRE's operation. (The Unix environment does not usually
1080 suffer from this problem, but it may sometimes be necessary to increase
1081 the maximum stack size. There is a discussion in the pcrestack docu-
1082 mentation.) An alternative approach to recursion that uses memory from
1083 the heap to remember data, instead of using recursive function calls,
1084 has been implemented to work round the problem of limited stack size.
1085 If you want to build a version of PCRE that works this way, add
1086
1087 --disable-stack-for-recursion
1088
1089 to the configure command. With this configuration, PCRE will use the
1090 pcre_stack_malloc and pcre_stack_free variables to call memory manage-
1091 ment functions. By default these point to malloc() and free(), but you
1092 can replace the pointers so that your own functions are used instead.
1093
1094 Separate functions are provided rather than using pcre_malloc and
1095 pcre_free because the usage is very predictable: the block sizes
1096 requested are always the same, and the blocks are always freed in
1097 reverse order. A calling program might be able to implement optimized
1098 functions that perform better than malloc() and free(). PCRE runs
1099 noticeably more slowly when built in this way. This option affects only
1100 the pcre_exec() function; it is not relevant for pcre_dfa_exec().
1101
1102
1103 LIMITING PCRE RESOURCE USAGE
1104
1105 Internally, PCRE has a function called match(), which it calls repeat-
1106 edly (sometimes recursively) when matching a pattern with the
1107 pcre_exec() function. By controlling the maximum number of times this
1108 function may be called during a single matching operation, a limit can
1109 be placed on the resources used by a single call to pcre_exec(). The
1110 limit can be changed at run time, as described in the pcreapi documen-
1111 tation. The default is 10 million, but this can be changed by adding a
1112 setting such as
1113
1114 --with-match-limit=500000
1115
1116 to the configure command. This setting has no effect on the
1117 pcre_dfa_exec() matching function.
1118
1119 In some environments it is desirable to limit the depth of recursive
1120 calls of match() more strictly than the total number of calls, in order
1121 to restrict the maximum amount of stack (or heap, if --disable-stack-
1122 for-recursion is specified) that is used. A second limit controls this;
1123 it defaults to the value that is set for --with-match-limit, which
1124 imposes no additional constraints. However, you can set a lower limit
1125 by adding, for example,
1126
1127 --with-match-limit-recursion=10000
1128
1129 to the configure command. This value can also be overridden at run
1130 time.
1131
1132
1133 CREATING CHARACTER TABLES AT BUILD TIME
1134
1135 PCRE uses fixed tables for processing characters whose code values are
1136 less than 256. By default, PCRE is built with a set of tables that are
1137 distributed in the file pcre_chartables.c.dist. These tables are for
1138 ASCII codes only. If you add
1139
1140 --enable-rebuild-chartables
1141
1142 to the configure command, the distributed tables are no longer used.
1143 Instead, a program called dftables is compiled and run. This outputs
1144 the source for new set of tables, created in the default locale of your
1145 C run-time system. (This method of replacing the tables does not work
1146 if you are cross compiling, because dftables is run on the local host.
1147 If you need to create alternative tables when cross compiling, you will
1148 have to do so "by hand".)
1149
1150
1151 USING EBCDIC CODE
1152
1153 PCRE assumes by default that it will run in an environment where the
1154 character code is ASCII (or Unicode, which is a superset of ASCII).
1155 This is the case for most computer operating systems. PCRE can, how-
1156 ever, be compiled to run in an EBCDIC environment by adding
1157
1158 --enable-ebcdic
1159
1160 to the configure command. This setting implies --enable-rebuild-charta-
1161 bles. You should only use it if you know that you are in an EBCDIC
1162 environment (for example, an IBM mainframe operating system). The
1163 --enable-ebcdic option is incompatible with --enable-utf.
1164
1165 The EBCDIC character that corresponds to an ASCII LF is assumed to have
1166 the value 0x15 by default. However, in some EBCDIC environments, 0x25
1167 is used. In such an environment you should use
1168
1169 --enable-ebcdic-nl25
1170
1171 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
1172 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
1173 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
1174 acter (which, in Unicode, is 0x85).
1175
1176 The options that select newline behaviour, such as --enable-newline-is-
1177 cr, and equivalent run-time options, refer to these character values in
1178 an EBCDIC environment.
1179
1180
1181 PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
1182
1183 By default, pcregrep reads all files as plain text. You can build it so
1184 that it recognizes files whose names end in .gz or .bz2, and reads them
1185 with libz or libbz2, respectively, by adding one or both of
1186
1187 --enable-pcregrep-libz
1188 --enable-pcregrep-libbz2
1189
1190 to the configure command. These options naturally require that the rel-
1191 evant libraries are installed on your system. Configuration will fail
1192 if they are not.
1193
1194
1195 PCREGREP BUFFER SIZE
1196
1197 pcregrep uses an internal buffer to hold a "window" on the file it is
1198 scanning, in order to be able to output "before" and "after" lines when
1199 it finds a match. The size of the buffer is controlled by a parameter
1200 whose default value is 20K. The buffer itself is three times this size,
1201 but because of the way it is used for holding "before" lines, the long-
1202 est line that is guaranteed to be processable is the parameter size.
1203 You can change the default parameter value by adding, for example,
1204
1205 --with-pcregrep-bufsize=50K
1206
1207 to the configure command. The caller of pcregrep can, however, override
1208 this value by specifying a run-time option.
1209
1210
1211 PCRETEST OPTION FOR LIBREADLINE SUPPORT
1212
1213 If you add
1214
1215 --enable-pcretest-libreadline
1216
1217 to the configure command, pcretest is linked with the libreadline
1218 library, and when its input is from a terminal, it reads it using the
1219 readline() function. This provides line-editing and history facilities.
1220 Note that libreadline is GPL-licensed, so if you distribute a binary of
1221 pcretest linked in this way, there may be licensing issues.
1222
1223 Setting this option causes the -lreadline option to be added to the
1224 pcretest build. In many operating environments with a sytem-installed
1225 libreadline this is sufficient. However, in some environments (e.g. if
1226 an unmodified distribution version of readline is in use), some extra
1227 configuration may be necessary. The INSTALL file for libreadline says
1228 this:
1229
1230 "Readline uses the termcap functions, but does not link with the
1231 termcap or curses library itself, allowing applications which link
1232 with readline the to choose an appropriate library."
1233
1234 If your environment has not been set up so that an appropriate library
1235 is automatically included, you may need to add something like
1236
1237 LIBS="-ncurses"
1238
1239 immediately before the configure command.
1240
1241
1242 DEBUGGING WITH VALGRIND SUPPORT
1243
1244 By adding the
1245
1246 --enable-valgrind
1247
1248 option to to the configure command, PCRE will use valgrind annotations
1249 to mark certain memory regions as unaddressable. This allows it to
1250 detect invalid memory accesses, and is mostly useful for debugging PCRE
1251 itself.
1252
1253
1254 CODE COVERAGE REPORTING
1255
1256 If your C compiler is gcc, you can build a version of PCRE that can
1257 generate a code coverage report for its test suite. To enable this, you
1258 must install lcov version 1.6 or above. Then specify
1259
1260 --enable-coverage
1261
1262 to the configure command and build PCRE in the usual way.
1263
1264 Note that using ccache (a caching C compiler) is incompatible with code
1265 coverage reporting. If you have configured ccache to run automatically
1266 on your system, you must set the environment variable
1267
1268 CCACHE_DISABLE=1
1269
1270 before running make to build PCRE, so that ccache is not used.
1271
1272 When --enable-coverage is used, the following addition targets are
1273 added to the Makefile:
1274
1275 make coverage
1276
1277 This creates a fresh coverage report for the PCRE test suite. It is
1278 equivalent to running "make coverage-reset", "make coverage-baseline",
1279 "make check", and then "make coverage-report".
1280
1281 make coverage-reset
1282
1283 This zeroes the coverage counters, but does nothing else.
1284
1285 make coverage-baseline
1286
1287 This captures baseline coverage information.
1288
1289 make coverage-report
1290
1291 This creates the coverage report.
1292
1293 make coverage-clean-report
1294
1295 This removes the generated coverage report without cleaning the cover-
1296 age data itself.
1297
1298 make coverage-clean-data
1299
1300 This removes the captured coverage data without removing the coverage
1301 files created at compile time (*.gcno).
1302
1303 make coverage-clean
1304
1305 This cleans all coverage data including the generated coverage report.
1306 For more information about code coverage, see the gcov and lcov docu-
1307 mentation.
1308
1309
1310 SEE ALSO
1311
1312 pcreapi(3), pcre16, pcre32, pcre_config(3).
1313
1314
1315 AUTHOR
1316
1317 Philip Hazel
1318 University Computing Service
1319 Cambridge CB2 3QH, England.
1320
1321
1322 REVISION
1323
1324 Last updated: 30 October 2012
1325 Copyright (c) 1997-2012 University of Cambridge.
1326 ------------------------------------------------------------------------------
1327
1328
1329 PCREMATCHING(3) Library Functions Manual PCREMATCHING(3)
1330
1331
1332
1333 NAME
1334 PCRE - Perl-compatible regular expressions
1335
1336 PCRE MATCHING ALGORITHMS
1337
1338 This document describes the two different algorithms that are available
1339 in PCRE for matching a compiled regular expression against a given sub-
1340 ject string. The "standard" algorithm is the one provided by the
1341 pcre_exec(), pcre16_exec() and pcre32_exec() functions. These work in
1342 the same as as Perl's matching function, and provide a Perl-compatible
1343 matching operation. The just-in-time (JIT) optimization that is
1344 described in the pcrejit documentation is compatible with these func-
1345 tions.
1346
1347 An alternative algorithm is provided by the pcre_dfa_exec(),
1348 pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in a
1349 different way, and are not Perl-compatible. This alternative has advan-
1350 tages and disadvantages compared with the standard algorithm, and these
1351 are described below.
1352
1353 When there is only one possible way in which a given subject string can
1354 match a pattern, the two algorithms give the same answer. A difference
1355 arises, however, when there are multiple possibilities. For example, if
1356 the pattern
1357
1358 ^<.*>
1359
1360 is matched against the string
1361
1362 <something> <something else> <something further>
1363
1364 there are three possible answers. The standard algorithm finds only one
1365 of them, whereas the alternative algorithm finds all three.
1366
1367
1368 REGULAR EXPRESSIONS AS TREES
1369
1370 The set of strings that are matched by a regular expression can be rep-
1371 resented as a tree structure. An unlimited repetition in the pattern
1372 makes the tree of infinite size, but it is still a tree. Matching the
1373 pattern to a given subject string (from a given starting point) can be
1374 thought of as a search of the tree. There are two ways to search a
1375 tree: depth-first and breadth-first, and these correspond to the two
1376 matching algorithms provided by PCRE.
1377
1378
1379 THE STANDARD MATCHING ALGORITHM
1380
1381 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
1382 sions", the standard algorithm is an "NFA algorithm". It conducts a
1383 depth-first search of the pattern tree. That is, it proceeds along a
1384 single path through the tree, checking that the subject matches what is
1385 required. When there is a mismatch, the algorithm tries any alterna-
1386 tives at the current point, and if they all fail, it backs up to the
1387 previous branch point in the tree, and tries the next alternative
1388 branch at that level. This often involves backing up (moving to the
1389 left) in the subject string as well. The order in which repetition
1390 branches are tried is controlled by the greedy or ungreedy nature of
1391 the quantifier.
1392
1393 If a leaf node is reached, a matching string has been found, and at
1394 that point the algorithm stops. Thus, if there is more than one possi-
1395 ble match, this algorithm returns the first one that it finds. Whether
1396 this is the shortest, the longest, or some intermediate length depends
1397 on the way the greedy and ungreedy repetition quantifiers are specified
1398 in the pattern.
1399
1400 Because it ends up with a single path through the tree, it is rela-
1401 tively straightforward for this algorithm to keep track of the sub-
1402 strings that are matched by portions of the pattern in parentheses.
1403 This provides support for capturing parentheses and back references.
1404
1405
1406 THE ALTERNATIVE MATCHING ALGORITHM
1407
1408 This algorithm conducts a breadth-first search of the tree. Starting
1409 from the first matching point in the subject, it scans the subject
1410 string from left to right, once, character by character, and as it does
1411 this, it remembers all the paths through the tree that represent valid
1412 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
1413 though it is not implemented as a traditional finite state machine (it
1414 keeps multiple states active simultaneously).
1415
1416 Although the general principle of this matching algorithm is that it
1417 scans the subject string only once, without backtracking, there is one
1418 exception: when a lookaround assertion is encountered, the characters
1419 following or preceding the current point have to be independently
1420 inspected.
1421
1422 The scan continues until either the end of the subject is reached, or
1423 there are no more unterminated paths. At this point, terminated paths
1424 represent the different matching possibilities (if there are none, the
1425 match has failed). Thus, if there is more than one possible match,
1426 this algorithm finds all of them, and in particular, it finds the long-
1427 est. The matches are returned in decreasing order of length. There is
1428 an option to stop the algorithm after the first match (which is neces-
1429 sarily the shortest) is found.
1430
1431 Note that all the matches that are found start at the same point in the
1432 subject. If the pattern
1433
1434 cat(er(pillar)?)?
1435
1436 is matched against the string "the caterpillar catchment", the result
1437 will be the three strings "caterpillar", "cater", and "cat" that start
1438 at the fifth character of the subject. The algorithm does not automati-
1439 cally move on to find matches that start at later positions.
1440
1441 There are a number of features of PCRE regular expressions that are not
1442 supported by the alternative matching algorithm. They are as follows:
1443
1444 1. Because the algorithm finds all possible matches, the greedy or
1445 ungreedy nature of repetition quantifiers is not relevant. Greedy and
1446 ungreedy quantifiers are treated in exactly the same way. However, pos-
1447 sessive quantifiers can make a difference when what follows could also
1448 match what is quantified, for example in a pattern like this:
1449
1450 ^a++\w!
1451
1452 This pattern matches "aaab!" but not "aaa!", which would be matched by
1453 a non-possessive quantifier. Similarly, if an atomic group is present,
1454 it is matched as if it were a standalone pattern at the current point,
1455 and the longest match is then "locked in" for the rest of the overall
1456 pattern.
1457
1458 2. When dealing with multiple paths through the tree simultaneously, it
1459 is not straightforward to keep track of captured substrings for the
1460 different matching possibilities, and PCRE's implementation of this
1461 algorithm does not attempt to do this. This means that no captured sub-
1462 strings are available.
1463
1464 3. Because no substrings are captured, back references within the pat-
1465 tern are not supported, and cause errors if encountered.
1466
1467 4. For the same reason, conditional expressions that use a backrefer-
1468 ence as the condition or test for a specific group recursion are not
1469 supported.
1470
1471 5. Because many paths through the tree may be active, the \K escape
1472 sequence, which resets the start of the match when encountered (but may
1473 be on some paths and not on others), is not supported. It causes an
1474 error if encountered.
1475
1476 6. Callouts are supported, but the value of the capture_top field is
1477 always 1, and the value of the capture_last field is always -1.
1478
1479 7. The \C escape sequence, which (in the standard algorithm) always
1480 matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
1481 not supported in these modes, because the alternative algorithm moves
1482 through the subject string one character (not data unit) at a time, for
1483 all active paths through the tree.
1484
1485 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1486 are not supported. (*FAIL) is supported, and behaves like a failing
1487 negative assertion.
1488
1489
1490 ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1491
1492 Using the alternative matching algorithm provides the following advan-
1493 tages:
1494
1495 1. All possible matches (at a single point in the subject) are automat-
1496 ically found, and in particular, the longest match is found. To find
1497 more than one match using the standard algorithm, you have to do kludgy
1498 things with callouts.
1499
1500 2. Because the alternative algorithm scans the subject string just
1501 once, and never needs to backtrack (except for lookbehinds), it is pos-
1502 sible to pass very long subject strings to the matching function in
1503 several pieces, checking for partial matching each time. Although it is
1504 possible to do multi-segment matching using the standard algorithm by
1505 retaining partially matched substrings, it is more complicated. The
1506 pcrepartial documentation gives details of partial matching and dis-
1507 cusses multi-segment matching.
1508
1509
1510 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1511
1512 The alternative algorithm suffers from a number of disadvantages:
1513
1514 1. It is substantially slower than the standard algorithm. This is
1515 partly because it has to search for all possible matches, but is also
1516 because it is less susceptible to optimization.
1517
1518 2. Capturing parentheses and back references are not supported.
1519
1520 3. Although atomic groups are supported, their use does not provide the
1521 performance advantage that it does for the standard algorithm.
1522
1523
1524 AUTHOR
1525
1526 Philip Hazel
1527 University Computing Service
1528 Cambridge CB2 3QH, England.
1529
1530
1531 REVISION
1532
1533 Last updated: 08 January 2012
1534 Copyright (c) 1997-2012 University of Cambridge.
1535 ------------------------------------------------------------------------------
1536
1537
1538 PCREAPI(3) Library Functions Manual PCREAPI(3)
1539
1540
1541
1542 NAME
1543 PCRE - Perl-compatible regular expressions
1544
1545 #include <pcre.h>
1546
1547
1548 PCRE NATIVE API BASIC FUNCTIONS
1549
1550 pcre *pcre_compile(const char *pattern, int options,
1551 const char **errptr, int *erroffset,
1552 const unsigned char *tableptr);
1553
1554 pcre *pcre_compile2(const char *pattern, int options,
1555 int *errorcodeptr,
1556 const char **errptr, int *erroffset,
1557 const unsigned char *tableptr);
1558
1559 pcre_extra *pcre_study(const pcre *code, int options,
1560 const char **errptr);
1561
1562 void pcre_free_study(pcre_extra *extra);
1563
1564 int pcre_exec(const pcre *code, const pcre_extra *extra,
1565 const char *subject, int length, int startoffset,
1566 int options, int *ovector, int ovecsize);
1567
1568 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1569 const char *subject, int length, int startoffset,
1570 int options, int *ovector, int ovecsize,
1571 int *workspace, int wscount);
1572
1573
1574 PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1575
1576 int pcre_copy_named_substring(const pcre *code,
1577 const char *subject, int *ovector,
1578 int stringcount, const char *stringname,
1579 char *buffer, int buffersize);
1580
1581 int pcre_copy_substring(const char *subject, int *ovector,
1582 int stringcount, int stringnumber, char *buffer,
1583 int buffersize);
1584
1585 int pcre_get_named_substring(const pcre *code,
1586 const char *subject, int *ovector,
1587 int stringcount, const char *stringname,
1588 const char **stringptr);
1589
1590 int pcre_get_stringnumber(const pcre *code,
1591 const char *name);
1592
1593 int pcre_get_stringtable_entries(const pcre *code,
1594 const char *name, char **first, char **last);
1595
1596 int pcre_get_substring(const char *subject, int *ovector,
1597 int stringcount, int stringnumber,
1598 const char **stringptr);
1599
1600 int pcre_get_substring_list(const char *subject,
1601 int *ovector, int stringcount, const char ***listptr);
1602
1603 void pcre_free_substring(const char *stringptr);
1604
1605 void pcre_free_substring_list(const char **stringptr);
1606
1607
1608 PCRE NATIVE API AUXILIARY FUNCTIONS
1609
1610 int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
1611 const char *subject, int length, int startoffset,
1612 int options, int *ovector, int ovecsize,
1613 pcre_jit_stack *jstack);
1614
1615 pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1616
1617 void pcre_jit_stack_free(pcre_jit_stack *stack);
1618
1619 void pcre_assign_jit_stack(pcre_extra *extra,
1620 pcre_jit_callback callback, void *data);
1621
1622 const unsigned char *pcre_maketables(void);
1623
1624 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1625 int what, void *where);
1626
1627 int pcre_refcount(pcre *code, int adjust);
1628
1629 int pcre_config(int what, void *where);
1630
1631 const char *pcre_version(void);
1632
1633 int pcre_pattern_to_host_byte_order(pcre *code,
1634 pcre_extra *extra, const unsigned char *tables);
1635
1636
1637 PCRE NATIVE API INDIRECTED FUNCTIONS
1638
1639 void *(*pcre_malloc)(size_t);
1640
1641 void (*pcre_free)(void *);
1642
1643 void *(*pcre_stack_malloc)(size_t);
1644
1645 void (*pcre_stack_free)(void *);
1646
1647 int (*pcre_callout)(pcre_callout_block *);
1648
1649
1650 PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1651
1652 As well as support for 8-bit character strings, PCRE also supports
1653 16-bit strings (from release 8.30) and 32-bit strings (from release
1654 8.32), by means of two additional libraries. They can be built as well
1655 as, or instead of, the 8-bit library. To avoid too much complication,
1656 this document describes the 8-bit versions of the functions, with only
1657 occasional references to the 16-bit and 32-bit libraries.
1658
1659 The 16-bit and 32-bit functions operate in the same way as their 8-bit
1660 counterparts; they just use different data types for their arguments
1661 and results, and their names start with pcre16_ or pcre32_ instead of
1662 pcre_. For every option that has UTF8 in its name (for example,
1663 PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8
1664 replaced by UTF16 or UTF32, respectively. This facility is in fact just
1665 cosmetic; the 16-bit and 32-bit option names define the same bit val-
1666 ues.
1667
1668 References to bytes and UTF-8 in this document should be read as refer-
1669 ences to 16-bit data quantities and UTF-16 when using the 16-bit
1670 library, or 32-bit data quantities and UTF-32 when using the 32-bit
1671 library, unless specified otherwise. More details of the specific dif-
1672 ferences for the 16-bit and 32-bit libraries are given in the pcre16
1673 and pcre32 pages.
1674
1675
1676 PCRE API OVERVIEW
1677
1678 PCRE has its own native API, which is described in this document. There
1679 are also some wrapper functions (for the 8-bit library only) that cor-
1680 respond to the POSIX regular expression API, but they do not give
1681 access to all the functionality. They are described in the pcreposix
1682 documentation. Both of these APIs define a set of C function calls. A
1683 C++ wrapper (again for the 8-bit library only) is also distributed with
1684 PCRE. It is documented in the pcrecpp page.
1685
1686 The native API C function prototypes are defined in the header file
1687 pcre.h, and on Unix-like systems the (8-bit) library itself is called
1688 libpcre. It can normally be accessed by adding -lpcre to the command
1689 for linking an application that uses PCRE. The header file defines the
1690 macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1691 numbers for the library. Applications can use these to include support
1692 for different releases of PCRE.
1693
1694 In a Windows environment, if you want to statically link an application
1695 program against a non-dll pcre.a file, you must define PCRE_STATIC
1696 before including pcre.h or pcrecpp.h, because otherwise the pcre_mal-
1697 loc() and pcre_free() exported functions will be declared
1698 __declspec(dllimport), with unwanted results.
1699
1700 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
1701 pcre_exec() are used for compiling and matching regular expressions in
1702 a Perl-compatible manner. A sample program that demonstrates the sim-
1703 plest way of using them is provided in the file called pcredemo.c in
1704 the PCRE source distribution. A listing of this program is given in the
1705 pcredemo documentation, and the pcresample documentation describes how
1706 to compile and run it.
1707
1708 Just-in-time compiler support is an optional feature of PCRE that can
1709 be built in appropriate hardware environments. It greatly speeds up the
1710 matching performance of many patterns. Simple programs can easily
1711 request that it be used if available, by setting an option that is
1712 ignored when it is not relevant. More complicated programs might need
1713 to make use of the functions pcre_jit_stack_alloc(),
1714 pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
1715 the JIT code's memory usage.
1716
1717 From release 8.32 there is also a direct interface for JIT execution,
1718 which gives improved performance. The JIT-specific functions are dis-
1719 cussed in the pcrejit documentation.
1720
1721 A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1722 ble, is also provided. This uses a different algorithm for the match-
1723 ing. The alternative algorithm finds all possible matches (at a given
1724 point in the subject), and scans the subject just once (unless there
1725 are lookbehind assertions). However, this algorithm does not return
1726 captured substrings. A description of the two matching algorithms and
1727 their advantages and disadvantages is given in the pcrematching docu-
1728 mentation.
1729
1730 In addition to the main compiling and matching functions, there are
1731 convenience functions for extracting captured substrings from a subject
1732 string that is matched by pcre_exec(). They are:
1733
1734 pcre_copy_substring()
1735 pcre_copy_named_substring()
1736 pcre_get_substring()
1737 pcre_get_named_substring()
1738 pcre_get_substring_list()
1739 pcre_get_stringnumber()
1740 pcre_get_stringtable_entries()
1741
1742 pcre_free_substring() and pcre_free_substring_list() are also provided,
1743 to free the memory used for extracted strings.
1744
1745 The function pcre_maketables() is used to build a set of character
1746 tables in the current locale for passing to pcre_compile(),
1747 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
1748 provided for specialist use. Most commonly, no special tables are
1749 passed, in which case internal tables that are generated when PCRE is
1750 built are used.
1751
1752 The function pcre_fullinfo() is used to find out information about a
1753 compiled pattern. The function pcre_version() returns a pointer to a
1754 string containing the version of PCRE and its date of release.
1755
1756 The function pcre_refcount() maintains a reference count in a data
1757 block containing a compiled pattern. This is provided for the benefit
1758 of object-oriented applications.
1759
1760 The global variables pcre_malloc and pcre_free initially contain the
1761 entry points of the standard malloc() and free() functions, respec-
1762 tively. PCRE calls the memory management functions via these variables,
1763 so a calling program can replace them if it wishes to intercept the
1764 calls. This should be done before calling any PCRE functions.
1765
1766 The global variables pcre_stack_malloc and pcre_stack_free are also
1767 indirections to memory management functions. These special functions
1768 are used only when PCRE is compiled to use the heap for remembering
1769 data, instead of recursive function calls, when running the pcre_exec()
1770 function. See the pcrebuild documentation for details of how to do
1771 this. It is a non-standard way of building PCRE, for use in environ-
1772 ments that have limited stacks. Because of the greater use of memory
1773 management, it runs more slowly. Separate functions are provided so
1774 that special-purpose external code can be used for this case. When
1775 used, these functions are always called in a stack-like manner (last
1776 obtained, first freed), and always for memory blocks of the same size.
1777 There is a discussion about PCRE's stack usage in the pcrestack docu-
1778 mentation.
1779
1780 The global variable pcre_callout initially contains NULL. It can be set
1781 by the caller to a "callout" function, which PCRE will then call at
1782 specified points during a matching operation. Details are given in the
1783 pcrecallout documentation.
1784
1785
1786 NEWLINES
1787
1788 PCRE supports five different conventions for indicating line breaks in
1789 strings: a single CR (carriage return) character, a single LF (line-
1790 feed) character, the two-character sequence CRLF, any of the three pre-
1791 ceding, or any Unicode newline sequence. The Unicode newline sequences
1792 are the three just mentioned, plus the single characters VT (vertical
1793 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1794 separator, U+2028), and PS (paragraph separator, U+2029).
1795
1796 Each of the first three conventions is used by at least one operating
1797 system as its standard newline sequence. When PCRE is built, a default
1798 can be specified. The default default is LF, which is the Unix stan-
1799 dard. When PCRE is run, the default can be overridden, either when a
1800 pattern is compiled, or when it is matched.
1801
1802 At compile time, the newline convention can be specified by the options
1803 argument of pcre_compile(), or it can be specified by special text at
1804 the start of the pattern itself; this overrides any other settings. See
1805 the pcrepattern page for details of the special character sequences.
1806
1807 In the PCRE documentation the word "newline" is used to mean "the char-
1808 acter or pair of characters that indicate a line break". The choice of
1809 newline convention affects the handling of the dot, circumflex, and
1810 dollar metacharacters, the handling of #-comments in /x mode, and, when
1811 CRLF is a recognized line ending sequence, the match position advance-
1812 ment for a non-anchored pattern. There is more detail about this in the
1813 section on pcre_exec() options below.
1814
1815 The choice of newline convention does not affect the interpretation of
1816 the \n or \r escape sequences, nor does it affect what \R matches,
1817 which is controlled in a similar way, but by separate options.
1818
1819
1820 MULTITHREADING
1821
1822 The PCRE functions can be used in multi-threading applications, with
1823 the proviso that the memory management functions pointed to by
1824 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1825 callout function pointed to by pcre_callout, are shared by all threads.
1826
1827 The compiled form of a regular expression is not altered during match-
1828 ing, so the same compiled pattern can safely be used by several threads
1829 at once.
1830
1831 If the just-in-time optimization feature is being used, it needs sepa-
1832 rate memory stack areas for each thread. See the pcrejit documentation
1833 for more details.
1834
1835
1836 SAVING PRECOMPILED PATTERNS FOR LATER USE
1837
1838 The compiled form of a regular expression can be saved and re-used at a
1839 later time, possibly by a different program, and even on a host other
1840 than the one on which it was compiled. Details are given in the
1841 pcreprecompile documentation, which includes a description of the
1842 pcre_pattern_to_host_byte_order() function. However, compiling a regu-
1843 lar expression with one version of PCRE for use with a different ver-
1844 sion is not guaranteed to work and may cause crashes.
1845
1846
1847 CHECKING BUILD-TIME OPTIONS
1848
1849 int pcre_config(int what, void *where);
1850
1851 The function pcre_config() makes it possible for a PCRE client to dis-
1852 cover which optional features have been compiled into the PCRE library.
1853 The pcrebuild documentation has more details about these optional fea-
1854 tures.
1855
1856 The first argument for pcre_config() is an integer, specifying which
1857 information is required; the second argument is a pointer to a variable
1858 into which the information is placed. The returned value is zero on
1859 success, or the negative error code PCRE_ERROR_BADOPTION if the value
1860 in the first argument is not recognized. The following information is
1861 available:
1862
1863 PCRE_CONFIG_UTF8
1864
1865 The output is an integer that is set to one if UTF-8 support is avail-
1866 able; otherwise it is set to zero. This value should normally be given
1867 to the 8-bit version of this function, pcre_config(). If it is given to
1868 the 16-bit or 32-bit version of this function, the result is
1869 PCRE_ERROR_BADOPTION.
1870
1871 PCRE_CONFIG_UTF16
1872
1873 The output is an integer that is set to one if UTF-16 support is avail-
1874 able; otherwise it is set to zero. This value should normally be given
1875 to the 16-bit version of this function, pcre16_config(). If it is given
1876 to the 8-bit or 32-bit version of this function, the result is
1877 PCRE_ERROR_BADOPTION.
1878
1879 PCRE_CONFIG_UTF32
1880
1881 The output is an integer that is set to one if UTF-32 support is avail-
1882 able; otherwise it is set to zero. This value should normally be given
1883 to the 32-bit version of this function, pcre32_config(). If it is given
1884 to the 8-bit or 16-bit version of this function, the result is
1885 PCRE_ERROR_BADOPTION.
1886
1887 PCRE_CONFIG_UNICODE_PROPERTIES
1888
1889 The output is an integer that is set to one if support for Unicode
1890 character properties is available; otherwise it is set to zero.
1891
1892 PCRE_CONFIG_JIT
1893
1894 The output is an integer that is set to one if support for just-in-time
1895 compiling is available; otherwise it is set to zero.
1896
1897 PCRE_CONFIG_JITTARGET
1898
1899 The output is a pointer to a zero-terminated "const char *" string. If
1900 JIT support is available, the string contains the name of the architec-
1901 ture for which the JIT compiler is configured, for example "x86 32bit
1902 (little endian + unaligned)". If JIT support is not available, the
1903 result is NULL.
1904
1905 PCRE_CONFIG_NEWLINE
1906
1907 The output is an integer whose value specifies the default character
1908 sequence that is recognized as meaning "newline". The values that are
1909 supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
1910 for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
1911 ANYCRLF, and ANY yield the same values. However, the value for LF is
1912 normally 21, though some EBCDIC environments use 37. The corresponding
1913 values for CRLF are 3349 and 3365. The default should normally corre-
1914 spond to the standard sequence for your operating system.
1915
1916 PCRE_CONFIG_BSR
1917
1918 The output is an integer whose value indicates what character sequences
1919 the \R escape sequence matches by default. A value of 0 means that \R
1920 matches any Unicode line ending sequence; a value of 1 means that \R
1921 matches only CR, LF, or CRLF. The default can be overridden when a pat-
1922 tern is compiled or matched.
1923
1924 PCRE_CONFIG_LINK_SIZE
1925
1926 The output is an integer that contains the number of bytes used for
1927 internal linkage in compiled regular expressions. For the 8-bit
1928 library, the value can be 2, 3, or 4. For the 16-bit library, the value
1929 is either 2 or 4 and is still a number of bytes. For the 32-bit
1930 library, the value is either 2 or 4 and is still a number of bytes. The
1931 default value of 2 is sufficient for all but the most massive patterns,
1932 since it allows the compiled pattern to be up to 64K in size. Larger
1933 values allow larger regular expressions to be compiled, at the expense
1934 of slower matching.
1935
1936 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1937
1938 The output is an integer that contains the threshold above which the
1939 POSIX interface uses malloc() for output vectors. Further details are
1940 given in the pcreposix documentation.
1941
1942 PCRE_CONFIG_MATCH_LIMIT
1943
1944 The output is a long integer that gives the default limit for the num-
1945 ber of internal matching function calls in a pcre_exec() execution.
1946 Further details are given with pcre_exec() below.
1947
1948 PCRE_CONFIG_MATCH_LIMIT_RECURSION
1949
1950 The output is a long integer that gives the default limit for the depth
1951 of recursion when calling the internal matching function in a
1952 pcre_exec() execution. Further details are given with pcre_exec()
1953 below.
1954
1955 PCRE_CONFIG_STACKRECURSE
1956
1957 The output is an integer that is set to one if internal recursion when
1958 running pcre_exec() is implemented by recursive function calls that use
1959 the stack to remember their state. This is the usual way that PCRE is
1960 compiled. The output is zero if PCRE was compiled to use blocks of data
1961 on the heap instead of recursive function calls. In this case,
1962 pcre_stack_malloc and pcre_stack_free are called to manage memory
1963 blocks on the heap, thus avoiding the use of the stack.
1964
1965
1966 COMPILING A PATTERN
1967
1968 pcre *pcre_compile(const char *pattern, int options,
1969 const char **errptr, int *erroffset,
1970 const unsigned char *tableptr);
1971
1972 pcre *pcre_compile2(const char *pattern, int options,
1973 int *errorcodeptr,
1974 const char **errptr, int *erroffset,
1975 const unsigned char *tableptr);
1976
1977 Either of the functions pcre_compile() or pcre_compile2() can be called
1978 to compile a pattern into an internal form. The only difference between
1979 the two interfaces is that pcre_compile2() has an additional argument,
1980 errorcodeptr, via which a numerical error code can be returned. To
1981 avoid too much repetition, we refer just to pcre_compile() below, but
1982 the information applies equally to pcre_compile2().
1983
1984 The pattern is a C string terminated by a binary zero, and is passed in
1985 the pattern argument. A pointer to a single block of memory that is
1986 obtained via pcre_malloc is returned. This contains the compiled code
1987 and related data. The pcre type is defined for the returned block; this
1988 is a typedef for a structure whose contents are not externally defined.
1989 It is up to the caller to free the memory (via pcre_free) when it is no
1990 longer required.
1991
1992 Although the compiled code of a PCRE regex is relocatable, that is, it
1993 does not depend on memory location, the complete pcre data block is not
1994 fully relocatable, because it may contain a copy of the tableptr argu-
1995 ment, which is an address (see below).
1996
1997 The options argument contains various bit settings that affect the com-
1998 pilation. It should be zero if no options are required. The available
1999 options are described below. Some of them (in particular, those that
2000 are compatible with Perl, but some others as well) can also be set and
2001 unset from within the pattern (see the detailed description in the
2002 pcrepattern documentation). For those options that can be different in
2003 different parts of the pattern, the contents of the options argument
2004 specifies their settings at the start of compilation and execution. The
2005 PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2006 PCRE_NO_START_OPTIMIZE options can be set at the time of matching as
2007 well as at compile time.
2008
2009 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
2010 if compilation of a pattern fails, pcre_compile() returns NULL, and
2011 sets the variable pointed to by errptr to point to a textual error mes-
2012 sage. This is a static string that is part of the library. You must not
2013 try to free it. Normally, the offset from the start of the pattern to
2014 the byte that was being processed when the error was discovered is
2015 placed in the variable pointed to by erroffset, which must not be NULL
2016 (if it is, an immediate error is given). However, for an invalid UTF-8
2017 string, the offset is that of the first byte of the failing character.
2018
2019 Some errors are not detected until the whole pattern has been scanned;
2020 in these cases, the offset passed back is the length of the pattern.
2021 Note that the offset is in bytes, not characters, even in UTF-8 mode.
2022 It may sometimes point into the middle of a UTF-8 character.
2023
2024 If pcre_compile2() is used instead of pcre_compile(), and the error-
2025 codeptr argument is not NULL, a non-zero error code number is returned
2026 via this argument in the event of an error. This is in addition to the
2027 textual error message. Error codes and messages are listed below.
2028
2029 If the final argument, tableptr, is NULL, PCRE uses a default set of
2030 character tables that are built when PCRE is compiled, using the
2031 default C locale. Otherwise, tableptr must be an address that is the
2032 result of a call to pcre_maketables(). This value is stored with the
2033 compiled pattern, and used again by pcre_exec(), unless another table
2034 pointer is passed to it. For more discussion, see the section on locale
2035 support below.
2036
2037 This code fragment shows a typical straightforward call to pcre_com-
2038 pile():
2039
2040 pcre *re;
2041 const char *error;
2042 int erroffset;
2043 re = pcre_compile(
2044 "^A.*Z", /* the pattern */
2045 0, /* default options */
2046 &error, /* for error message */
2047 &erroffset, /* for error offset */
2048 NULL); /* use default character tables */
2049
2050 The following names for option bits are defined in the pcre.h header
2051 file:
2052
2053 PCRE_ANCHORED
2054
2055 If this bit is set, the pattern is forced to be "anchored", that is, it
2056 is constrained to match only at the first matching point in the string
2057 that is being searched (the "subject string"). This effect can also be
2058 achieved by appropriate constructs in the pattern itself, which is the
2059 only way to do it in Perl.
2060
2061 PCRE_AUTO_CALLOUT
2062
2063 If this bit is set, pcre_compile() automatically inserts callout items,
2064 all with number 255, before each pattern item. For discussion of the
2065 callout facility, see the pcrecallout documentation.
2066
2067 PCRE_BSR_ANYCRLF
2068 PCRE_BSR_UNICODE
2069
2070 These options (which are mutually exclusive) control what the \R escape
2071 sequence matches. The choice is either to match only CR, LF, or CRLF,
2072 or to match any Unicode newline sequence. The default is specified when
2073 PCRE is built. It can be overridden from within the pattern, or by set-
2074 ting an option when a compiled pattern is matched.
2075
2076 PCRE_CASELESS
2077
2078 If this bit is set, letters in the pattern match both upper and lower
2079 case letters. It is equivalent to Perl's /i option, and it can be
2080 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2081 always understands the concept of case for characters whose values are
2082 less than 128, so caseless matching is always possible. For characters
2083 with higher values, the concept of case is supported if PCRE is com-
2084 piled with Unicode property support, but not otherwise. If you want to
2085 use caseless matching for characters 128 and above, you must ensure
2086 that PCRE is compiled with Unicode property support as well as with
2087 UTF-8 support.
2088
2089 PCRE_DOLLAR_ENDONLY
2090
2091 If this bit is set, a dollar metacharacter in the pattern matches only
2092 at the end of the subject string. Without this option, a dollar also
2093 matches immediately before a newline at the end of the string (but not
2094 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
2095 if PCRE_MULTILINE is set. There is no equivalent to this option in
2096 Perl, and no way to set it within a pattern.
2097
2098 PCRE_DOTALL
2099
2100 If this bit is set, a dot metacharacter in the pattern matches a char-
2101 acter of any value, including one that indicates a newline. However, it
2102 only ever matches one character, even if newlines are coded as CRLF.
2103 Without this option, a dot does not match when the current position is
2104 at a newline. This option is equivalent to Perl's /s option, and it can
2105 be changed within a pattern by a (?s) option setting. A negative class
2106 such as [^a] always matches newline characters, independent of the set-
2107 ting of this option.
2108
2109 PCRE_DUPNAMES
2110
2111 If this bit is set, names used to identify capturing subpatterns need
2112 not be unique. This can be helpful for certain types of pattern when it
2113 is known that only one instance of the named subpattern can ever be
2114 matched. There are more details of named subpatterns below; see also
2115 the pcrepattern documentation.
2116
2117 PCRE_EXTENDED
2118
2119 If this bit is set, white space data characters in the pattern are
2120 totally ignored except when escaped or inside a character class. White
2121 space does not include the VT character (code 11). In addition, charac-
2122 ters between an unescaped # outside a character class and the next new-
2123 line, inclusive, are also ignored. This is equivalent to Perl's /x
2124 option, and it can be changed within a pattern by a (?x) option set-
2125 ting.
2126
2127 Which characters are interpreted as newlines is controlled by the
2128 options passed to pcre_compile() or by a special sequence at the start
2129 of the pattern, as described in the section entitled "Newline conven-
2130 tions" in the pcrepattern documentation. Note that the end of this type
2131 of comment is a literal newline sequence in the pattern; escape
2132 sequences that happen to represent a newline do not count.
2133
2134 This option makes it possible to include comments inside complicated
2135 patterns. Note, however, that this applies only to data characters.
2136 White space characters may never appear within special character
2137 sequences in a pattern, for example within the sequence (?( that intro-
2138 duces a conditional subpattern.
2139
2140 PCRE_EXTRA
2141
2142 This option was invented in order to turn on additional functionality
2143 of PCRE that is incompatible with Perl, but it is currently of very
2144 little use. When set, any backslash in a pattern that is followed by a
2145 letter that has no special meaning causes an error, thus reserving
2146 these combinations for future expansion. By default, as in Perl, a
2147 backslash followed by a letter with no special meaning is treated as a
2148 literal. (Perl can, however, be persuaded to give an error for this, by
2149 running it with the -w option.) There are at present no other features
2150 controlled by this option. It can also be set by a (?X) option setting
2151 within a pattern.
2152
2153 PCRE_FIRSTLINE
2154
2155 If this option is set, an unanchored pattern is required to match
2156 before or at the first newline in the subject string, though the
2157 matched text may continue over the newline.
2158
2159 PCRE_JAVASCRIPT_COMPAT
2160
2161 If this option is set, PCRE's behaviour is changed in some ways so that
2162 it is compatible with JavaScript rather than Perl. The changes are as
2163 follows:
2164
2165 (1) A lone closing square bracket in a pattern causes a compile-time
2166 error, because this is illegal in JavaScript (by default it is treated
2167 as a data character). Thus, the pattern AB]CD becomes illegal when this
2168 option is set.
2169
2170 (2) At run time, a back reference to an unset subpattern group matches
2171 an empty string (by default this causes the current matching alterna-
2172 tive to fail). A pattern such as (\1)(a) succeeds when this option is
2173 set (assuming it can find an "a" in the subject), whereas it fails by
2174 default, for Perl compatibility.
2175
2176 (3) \U matches an upper case "U" character; by default \U causes a com-
2177 pile time error (Perl uses \U to upper case subsequent characters).
2178
2179 (4) \u matches a lower case "u" character unless it is followed by four
2180 hexadecimal digits, in which case the hexadecimal number defines the
2181 code point to match. By default, \u causes a compile time error (Perl
2182 uses it to upper case the following character).
2183
2184 (5) \x matches a lower case "x" character unless it is followed by two
2185 hexadecimal digits, in which case the hexadecimal number defines the
2186 code point to match. By default, as in Perl, a hexadecimal number is
2187 always expected after \x, but it may have zero, one, or two digits (so,
2188 for example, \xz matches a binary zero character followed by z).
2189
2190 PCRE_MULTILINE
2191
2192 By default, PCRE treats the subject string as consisting of a single
2193 line of characters (even if it actually contains newlines). The "start
2194 of line" metacharacter (^) matches only at the start of the string,
2195 while the "end of line" metacharacter ($) matches only at the end of
2196 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
2197 is set). This is the same as Perl.
2198
2199 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
2200 constructs match immediately following or immediately before internal
2201 newlines in the subject string, respectively, as well as at the very
2202 start and end. This is equivalent to Perl's /m option, and it can be
2203 changed within a pattern by a (?m) option setting. If there are no new-
2204 lines in a subject string, or no occurrences of ^ or $ in a pattern,
2205 setting PCRE_MULTILINE has no effect.
2206
2207 PCRE_NEWLINE_CR
2208 PCRE_NEWLINE_LF
2209 PCRE_NEWLINE_CRLF
2210 PCRE_NEWLINE_ANYCRLF
2211 PCRE_NEWLINE_ANY
2212
2213 These options override the default newline definition that was chosen
2214 when PCRE was built. Setting the first or the second specifies that a
2215 newline is indicated by a single character (CR or LF, respectively).
2216 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2217 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
2218 that any of the three preceding sequences should be recognized. Setting
2219 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
2220 recognized.
2221
2222 In an ASCII/Unicode environment, the Unicode newline sequences are the
2223 three just mentioned, plus the single characters VT (vertical tab,
2224 U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2225 arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit
2226 library, the last two are recognized only in UTF-8 mode.
2227
2228 When PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2229 code for CR is 0x0d, the same as ASCII. However, the character code for
2230 LF is normally 0x15, though in some EBCDIC environments 0x25 is used.
2231 Whichever of these is not LF is made to correspond to Unicode's NEL
2232 character. EBCDIC codes are all less than 256. For more details, see
2233 the pcrebuild documentation.
2234
2235 The newline setting in the options word uses three bits that are
2236 treated as a number, giving eight possibilities. Currently only six are
2237 used (default plus the five values above). This means that if you set
2238 more than one newline option, the combination may or may not be sensi-
2239 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
2240 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
2241 cause an error.
2242
2243 The only time that a line break in a pattern is specially recognized
2244 when compiling is when PCRE_EXTENDED is set. CR and LF are white space
2245 characters, and so are ignored in this mode. Also, an unescaped # out-
2246 side a character class indicates a comment that lasts until after the
2247 next line break sequence. In other circumstances, line break sequences
2248 in patterns are treated as literal data.
2249
2250 The newline option that is set at compile time becomes the default that
2251 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
2252
2253 PCRE_NO_AUTO_CAPTURE
2254
2255 If this option is set, it disables the use of numbered capturing paren-
2256 theses in the pattern. Any opening parenthesis that is not followed by
2257 ? behaves as if it were followed by ?: but named parentheses can still
2258 be used for capturing (and they acquire numbers in the usual way).
2259 There is no equivalent of this option in Perl.
2260
2261 NO_START_OPTIMIZE
2262
2263 This is an option that acts at matching time; that is, it is really an
2264 option for pcre_exec() or pcre_dfa_exec(). If it is set at compile
2265 time, it is remembered with the compiled pattern and assumed at match-
2266 ing time. For details see the discussion of PCRE_NO_START_OPTIMIZE
2267 below.
2268
2269 PCRE_UCP
2270
2271 This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
2272 \w, and some of the POSIX character classes. By default, only ASCII
2273 characters are recognized, but if PCRE_UCP is set, Unicode properties
2274 are used instead to classify characters. More details are given in the
2275 section on generic character types in the pcrepattern page. If you set
2276 PCRE_UCP, matching one of the items it affects takes much longer. The
2277 option is available only if PCRE has been compiled with Unicode prop-
2278 erty support.
2279
2280 PCRE_UNGREEDY
2281
2282 This option inverts the "greediness" of the quantifiers so that they
2283 are not greedy by default, but become greedy if followed by "?". It is
2284 not compatible with Perl. It can also be set by a (?U) option setting
2285 within the pattern.
2286
2287 PCRE_UTF8
2288
2289 This option causes PCRE to regard both the pattern and the subject as
2290 strings of UTF-8 characters instead of single-byte strings. However, it
2291 is available only when PCRE is built to include UTF support. If not,
2292 the use of this option provokes an error. Details of how this option
2293 changes the behaviour of PCRE are given in the pcreunicode page.
2294
2295 PCRE_NO_UTF8_CHECK
2296
2297 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2298 automatically checked. There is a discussion about the validity of
2299 UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is
2300 found, pcre_compile() returns an error. If you already know that your
2301 pattern is valid, and you want to skip this check for performance rea-
2302 sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the
2303 effect of passing an invalid UTF-8 string as a pattern is undefined. It
2304 may cause your program to crash. Note that this option can also be
2305 passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
2306 checking of subject strings only. If the same string is being matched
2307 many times, the option can be safely set for the second and subsequent
2308 matchings to improve performance.
2309
2310
2311 COMPILATION ERROR CODES
2312
2313 The following table lists the error codes than may be returned by
2314 pcre_compile2(), along with the error messages that may be returned by
2315 both compiling functions. Note that error messages are always 8-bit
2316 ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed,
2317 some error codes have fallen out of use. To avoid confusion, they have
2318 not been re-used.
2319
2320 0 no error
2321 1 \ at end of pattern
2322 2 \c at end of pattern
2323 3 unrecognized character follows \
2324 4 numbers out of order in {} quantifier
2325 5 number too big in {} quantifier
2326 6 missing terminating ] for character class
2327 7 invalid escape sequence in character class
2328 8 range out of order in character class
2329 9 nothing to repeat
2330 10 [this code is not in use]
2331 11 internal error: unexpected repeat
2332 12 unrecognized character after (? or (?-
2333 13 POSIX named classes are supported only within a class
2334 14 missing )
2335 15 reference to non-existent subpattern
2336 16 erroffset passed as NULL
2337 17 unknown option bit(s) set
2338 18 missing ) after comment
2339 19 [this code is not in use]
2340 20 regular expression is too large
2341 21 failed to get memory
2342 22 unmatched parentheses
2343 23 internal error: code overflow
2344 24 unrecognized character after (?<
2345 25 lookbehind assertion is not fixed length
2346 26 malformed number or name after (?(
2347 27 conditional group contains more than two branches
2348 28 assertion expected after (?(
2349 29 (?R or (?[+-]digits must be followed by )
2350 30 unknown POSIX class name
2351 31 POSIX collating elements are not supported
2352 32 this version of PCRE is compiled without UTF support
2353 33 [this code is not in use]
2354 34 character value in \x{...} sequence is too large
2355 35 invalid condition (?(0)
2356 36 \C not allowed in lookbehind assertion
2357 37 PCRE does not support \L, \l, \N{name}, \U, or \u
2358 38 number after (?C is > 255
2359 39 closing ) for (?C expected
2360 40 recursive call could loop indefinitely
2361 41 unrecognized character after (?P
2362 42 syntax error in subpattern name (missing terminator)
2363 43 two named subpatterns have the same name
2364 44 invalid UTF-8 string (specifically UTF-8)
2365 45 support for \P, \p, and \X has not been compiled
2366 46 malformed \P or \p sequence
2367 47 unknown property name after \P or \p
2368 48 subpattern name is too long (maximum 32 characters)
2369 49 too many named subpatterns (maximum 10000)
2370 50 [this code is not in use]
2371 51 octal value is greater than \377 in 8-bit non-UTF-8 mode
2372 52 internal error: overran compiling workspace
2373 53 internal error: previously-checked referenced subpattern
2374 not found
2375 54 DEFINE group contains more than one branch
2376 55 repeating a DEFINE group is not allowed
2377 56 inconsistent NEWLINE options
2378 57 \g is not followed by a braced, angle-bracketed, or quoted
2379 name/number or by a plain number
2380 58 a numbered reference must not be zero
2381 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
2382 60 (*VERB) not recognized
2383 61 number is too big
2384 62 subpattern name expected
2385 63 digit expected after (?+
2386 64 ] is an invalid data character in JavaScript compatibility mode
2387 65 different names for subpatterns of the same number are
2388 not allowed
2389 66 (*MARK) must have an argument
2390 67 this version of PCRE is not compiled with Unicode property
2391 support
2392 68 \c must be followed by an ASCII character
2393 69 \k is not followed by a braced, angle-bracketed, or quoted name
2394 70 internal error: unknown opcode in find_fixedlength()
2395 71 \N is not supported in a class
2396 72 too many forward references
2397 73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
2398 74 invalid UTF-16 string (specifically UTF-16)
2399 75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2400 76 character value in \u.... sequence is too large
2401 77 invalid UTF-32 string (specifically UTF-32)
2402
2403 The numbers 32 and 10000 in errors 48 and 49 are defaults; different
2404 values may be used if the limits were changed when PCRE was built.
2405
2406
2407 STUDYING A PATTERN
2408
2409 pcre_extra *pcre_study(const pcre *code, int options
2410 const char **errptr);
2411
2412 If a compiled pattern is going to be used several times, it is worth
2413 spending more time analyzing it in order to speed up the time taken for
2414 matching. The function pcre_study() takes a pointer to a compiled pat-
2415 tern as its first argument. If studying the pattern produces additional
2416 information that will help speed up matching, pcre_study() returns a
2417 pointer to a pcre_extra block, in which the study_data field points to
2418 the results of the study.
2419
2420 The returned value from pcre_study() can be passed directly to
2421 pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
2422 tains other fields that can be set by the caller before the block is
2423 passed; these are described below in the section on matching a pattern.
2424
2425 If studying the pattern does not produce any useful information,
2426 pcre_study() returns NULL by default. In that circumstance, if the
2427 calling program wants to pass any of the other fields to pcre_exec() or
2428 pcre_dfa_exec(), it must set up its own pcre_extra block. However, if
2429 pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it
2430 returns a pcre_extra block even if studying did not find any additional
2431 information. It may still return NULL, however, if an error occurs in
2432 pcre_study().
2433
2434 The second argument of pcre_study() contains option bits. There are
2435 three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2436
2437 PCRE_STUDY_JIT_COMPILE
2438 PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2439 PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2440
2441 If any of these are set, and the just-in-time compiler is available,
2442 the pattern is further compiled into machine code that executes much
2443 faster than the pcre_exec() interpretive matching function. If the
2444 just-in-time compiler is not available, these options are ignored. All
2445 undefined bits in the options argument must be zero.
2446
2447 JIT compilation is a heavyweight optimization. It can take some time
2448 for patterns to be analyzed, and for one-off matches and simple pat-
2449 terns the benefit of faster execution might be offset by a much slower
2450 study time. Not all patterns can be optimized by the JIT compiler. For
2451 those that cannot be handled, matching automatically falls back to the
2452 pcre_exec() interpreter. For more details, see the pcrejit documenta-
2453 tion.
2454
2455 The third argument for pcre_study() is a pointer for an error message.
2456 If studying succeeds (even if no data is returned), the variable it
2457 points to is set to NULL. Otherwise it is set to point to a textual
2458 error message. This is a static string that is part of the library. You
2459 must not try to free it. You should test the error pointer for NULL
2460 after calling pcre_study(), to be sure that it has run successfully.
2461
2462 When you are finished with a pattern, you can free the memory used for
2463 the study data by calling pcre_free_study(). This function was added to
2464 the API for release 8.20. For earlier versions, the memory could be
2465 freed with pcre_free(), just like the pattern itself. This will still
2466 work in cases where JIT optimization is not used, but it is advisable
2467 to change to the new function when convenient.
2468
2469 This is a typical way in which pcre_study() is used (except that in a
2470 real application there should be tests for errors):
2471
2472 int rc;
2473 pcre *re;
2474 pcre_extra *sd;
2475 re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
2476 sd = pcre_study(
2477 re, /* result of pcre_compile() */
2478 0, /* no options */
2479 &error); /* set to NULL or points to a message */
2480 rc = pcre_exec( /* see below for details of pcre_exec() options */
2481 re, sd, "subject", 7, 0, 0, ovector, 30);
2482 ...
2483 pcre_free_study(sd);
2484 pcre_free(re);
2485
2486 Studying a pattern does two things: first, a lower bound for the length
2487 of subject string that is needed to match the pattern is computed. This
2488 does not mean that there are any strings of that length that match, but
2489 it does guarantee that no shorter strings match. The value is used to
2490 avoid wasting time by trying to match strings that are shorter than the
2491 lower bound. You can find out the value in a calling program via the
2492 pcre_fullinfo() function.
2493
2494 Studying a pattern is also useful for non-anchored patterns that do not
2495 have a single fixed starting character. A bitmap of possible starting
2496 bytes is created. This speeds up finding a position in the subject at
2497 which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2498 values less than 256. In 32-bit mode, the bitmap is used for 32-bit
2499 values less than 256.)
2500
2501 These two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
2502 and the information is also used by the JIT compiler. The optimiza-
2503 tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
2504 calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
2505 tion is also disabled. You might want to do this if your pattern con-
2506 tains callouts or (*MARK) and you want to make use of these facilities
2507 in cases where matching fails. See the discussion of
2508 PCRE_NO_START_OPTIMIZE below.
2509
2510
2511 LOCALE SUPPORT
2512
2513 PCRE handles caseless matching, and determines whether characters are
2514 letters, digits, or whatever, by reference to a set of tables, indexed
2515 by character value. When running in UTF-8 mode, this applies only to
2516 characters with codes less than 128. By default, higher-valued codes
2517 never match escapes such as \w or \d, but they can be tested with \p if
2518 PCRE is built with Unicode character property support. Alternatively,
2519 the PCRE_UCP option can be set at compile time; this causes \w and
2520 friends to use Unicode property support instead of built-in tables. The
2521 use of locales with Unicode is discouraged. If you are handling charac-
2522 ters with codes greater than 128, you should either use UTF-8 and Uni-
2523 code, or use locales, but not try to mix the two.
2524
2525 PCRE contains an internal set of tables that are used when the final
2526 argument of pcre_compile() is NULL. These are sufficient for many
2527 applications. Normally, the internal tables recognize only ASCII char-
2528 acters. However, when PCRE is built, it is possible to cause the inter-
2529 nal tables to be rebuilt in the default "C" locale of the local system,
2530 which may cause them to be different.
2531
2532 The internal tables can always be overridden by tables supplied by the
2533 application that calls PCRE. These may be created in a different locale
2534 from the default. As more and more applications change to using Uni-
2535 code, the need for this locale support is expected to die away.
2536
2537 External tables are built by calling the pcre_maketables() function,
2538 which has no arguments, in the relevant locale. The result can then be
2539 passed to pcre_compile() or pcre_exec() as often as necessary. For
2540 example, to build and use tables that are appropriate for the French
2541 locale (where accented characters with values greater than 128 are
2542 treated as letters), the following code could be used:
2543
2544 setlocale(LC_CTYPE, "fr_FR");
2545 tables = pcre_maketables();
2546 re = pcre_compile(..., tables);
2547
2548 The locale name "fr_FR" is used on Linux and other Unix-like systems;
2549 if you are using Windows, the name for the French locale is "french".
2550
2551 When pcre_maketables() runs, the tables are built in memory that is
2552 obtained via pcre_malloc. It is the caller's responsibility to ensure
2553 that the memory containing the tables remains available for as long as
2554 it is needed.
2555
2556 The pointer that is passed to pcre_compile() is saved with the compiled
2557 pattern, and the same tables are used via this pointer by pcre_study()
2558 and normally also by pcre_exec(). Thus, by default, for any single pat-
2559 tern, compilation, studying and matching all happen in the same locale,
2560 but different patterns can be compiled in different locales.
2561
2562 It is possible to pass a table pointer or NULL (indicating the use of
2563 the internal tables) to pcre_exec(). Although not intended for this
2564 purpose, this facility could be used to match a pattern in a different
2565 locale from the one in which it was compiled. Passing table pointers at
2566 run time is discussed below in the section on matching a pattern.
2567
2568
2569 INFORMATION ABOUT A PATTERN
2570
2571 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2572 int what, void *where);
2573
2574 The pcre_fullinfo() function returns information about a compiled pat-
2575 tern. It replaces the pcre_info() function, which was removed from the
2576 library at version 8.30, after more than 10 years of obsolescence.
2577
2578 The first argument for pcre_fullinfo() is a pointer to the compiled
2579 pattern. The second argument is the result of pcre_study(), or NULL if
2580 the pattern was not studied. The third argument specifies which piece
2581 of information is required, and the fourth argument is a pointer to a
2582 variable to receive the data. The yield of the function is zero for
2583 success, or one of the following negative numbers:
2584
2585 PCRE_ERROR_NULL the argument code was NULL
2586 the argument where was NULL
2587 PCRE_ERROR_BADMAGIC the "magic number" was not found
2588 PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
2589 endianness
2590 PCRE_ERROR_BADOPTION the value of what was invalid
2591
2592 The "magic number" is placed at the start of each compiled pattern as
2593 an simple check against passing an arbitrary memory pointer. The endi-
2594 anness error can occur if a compiled pattern is saved and reloaded on a
2595 different host. Here is a typical call of pcre_fullinfo(), to obtain
2596 the length of the compiled pattern:
2597
2598 int rc;
2599 size_t length;
2600 rc = pcre_fullinfo(
2601 re, /* result of pcre_compile() */
2602 sd, /* result of pcre_study(), or NULL */
2603 PCRE_INFO_SIZE, /* what is required */
2604 &length); /* where to put the data */
2605
2606 The possible values for the third argument are defined in pcre.h, and
2607 are as follows:
2608
2609 PCRE_INFO_BACKREFMAX
2610
2611 Return the number of the highest back reference in the pattern. The
2612 fourth argument should point to an int variable. Zero is returned if
2613 there are no back references.
2614
2615 PCRE_INFO_CAPTURECOUNT
2616
2617 Return the number of capturing subpatterns in the pattern. The fourth
2618 argument should point to an int variable.
2619
2620 PCRE_INFO_DEFAULT_TABLES
2621
2622 Return a pointer to the internal default character tables within PCRE.
2623 The fourth argument should point to an unsigned char * variable. This
2624 information call is provided for internal use by the pcre_study() func-
2625 tion. External callers can cause PCRE to use its internal tables by
2626 passing a NULL table pointer.
2627
2628 PCRE_INFO_FIRSTBYTE
2629
2630 Return information about the first data unit of any matched string, for
2631 a non-anchored pattern. (The name of this option refers to the 8-bit
2632 library, where data units are bytes.) The fourth argument should point
2633 to an int variable.
2634
2635 If there is a fixed first value, for example, the letter "c" from a
2636 pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
2637 library, the value is always less than 256. In the 16-bit library the
2638 value can be up to 0xffff. In the 32-bit library the value can be up to
2639 0x10ffff.
2640
2641 If there is no fixed first value, and if either
2642
2643 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2644 branch starts with "^", or
2645
2646 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2647 set (if it were set, the pattern would be anchored),
2648
2649 -1 is returned, indicating that the pattern matches only at the start
2650 of a subject string or after any newline within the string. Otherwise
2651 -2 is returned. For anchored patterns, -2 is returned.
2652
2653 Since for the 32-bit library using the non-UTF-32 mode, this function
2654 is unable to return the full 32-bit range of the character, this value
2655 is deprecated; instead the PCRE_INFO_FIRSTCHARACTERFLAGS and
2656 PCRE_INFO_FIRSTCHARACTER values should be used.
2657
2658 PCRE_INFO_FIRSTTABLE
2659
2660 If the pattern was studied, and this resulted in the construction of a
2661 256-bit table indicating a fixed set of values for the first data unit
2662 in any matching string, a pointer to the table is returned. Otherwise
2663 NULL is returned. The fourth argument should point to an unsigned char
2664 * variable.
2665
2666 PCRE_INFO_HASCRORLF
2667
2668 Return 1 if the pattern contains any explicit matches for CR or LF
2669 characters, otherwise 0. The fourth argument should point to an int
2670 variable. An explicit match is either a literal CR or LF character, or
2671 \r or \n.
2672
2673 PCRE_INFO_JCHANGED
2674
2675 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2676 otherwise 0. The fourth argument should point to an int variable. (?J)
2677 and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2678
2679 PCRE_INFO_JIT
2680
2681 Return 1 if the pattern was studied with one of the JIT options, and
2682 just-in-time compiling was successful. The fourth argument should point
2683 to an int variable. A return value of 0 means that JIT support is not
2684 available in this version of PCRE, or that the pattern was not studied
2685 with a JIT option, or that the JIT compiler could not handle this par-
2686 ticular pattern. See the pcrejit documentation for details of what can
2687 and cannot be handled.
2688
2689 PCRE_INFO_JITSIZE
2690
2691 If the pattern was successfully studied with a JIT option, return the
2692 size of the JIT compiled code, otherwise return zero. The fourth argu-
2693 ment should point to a size_t variable.
2694
2695 PCRE_INFO_LASTLITERAL
2696
2697 Return the value of the rightmost literal data unit that must exist in
2698 any matched string, other than at its start, if such a value has been
2699 recorded. The fourth argument should point to an int variable. If there
2700 is no such value, -1 is returned. For anchored patterns, a last literal
2701 value is recorded only if it follows something of variable length. For
2702 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2703 /^a\dz\d/ the returned value is -1.
2704
2705 Since for the 32-bit library using the non-UTF-32 mode, this function
2706 is unable to return the full 32-bit range of the character, this value
2707 is deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and
2708 PCRE_INFO_REQUIREDCHAR values should be used.
2709
2710 PCRE_INFO_MAXLOOKBEHIND
2711
2712 Return the number of characters (NB not bytes) in the longest lookbe-
2713 hind assertion in the pattern. This information is useful when doing
2714 multi-segment matching using the partial matching facilities. Note that
2715 the simple assertions \b and \B require a one-character lookbehind. \A
2716 also registers a one-character lookbehind, though it does not actually
2717 inspect the previous character. This is to ensure that at least one
2718 character from the old segment is retained when a new segment is pro-
2719 cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
2720 match incorrectly at the start of a new segment.
2721
2722 PCRE_INFO_MINLENGTH
2723
2724 If the pattern was studied and a minimum length for matching subject
2725 strings was computed, its value is returned. Otherwise the returned
2726 value is -1. The value is a number of characters, which in UTF-8 mode
2727 may be different from the number of bytes. The fourth argument should
2728 point to an int variable. A non-negative value is a lower bound to the
2729 length of any matching string. There may not be any strings of that
2730 length that do actually match, but every string that does match is at
2731 least that long.
2732
2733 PCRE_INFO_NAMECOUNT
2734 PCRE_INFO_NAMEENTRYSIZE
2735 PCRE_INFO_NAMETABLE
2736
2737 PCRE supports the use of named as well as numbered capturing parenthe-
2738 ses. The names are just an additional way of identifying the parenthe-
2739 ses, which still acquire numbers. Several convenience functions such as
2740 pcre_get_named_substring() are provided for extracting captured sub-
2741 strings by name. It is also possible to extract the data directly, by
2742 first converting the name to a number in order to access the correct
2743 pointers in the output vector (described with pcre_exec() below). To do
2744 the conversion, you need to use the name-to-number map, which is
2745 described by these three values.
2746
2747 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2748 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2749 of each entry; both of these return an int value. The entry size
2750 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
2751 a pointer to the first entry of the table. This is a pointer to char in
2752 the 8-bit library, where the first two bytes of each entry are the num-
2753 ber of the capturing parenthesis, most significant byte first. In the
2754 16-bit library, the pointer points to 16-bit data units, the first of
2755 which contains the parenthesis number. In the 32-bit library, the
2756 pointer points to 32-bit data units, the first of which contains the
2757 parenthesis number. The rest of the entry is the corresponding name,
2758 zero terminated.
2759
2760 The names are in alphabetical order. Duplicate names may appear if (?|
2761 is used to create multiple groups with the same number, as described in
2762 the section on duplicate subpattern numbers in the pcrepattern page.
2763 Duplicate names for subpatterns with different numbers are permitted
2764 only if PCRE_DUPNAMES is set. In all cases of duplicate names, they
2765 appear in the table in the order in which they were found in the pat-
2766 tern. In the absence of (?| this is the order of increasing number;
2767 when (?| is used this is not necessarily the case because later subpat-
2768 terns may have lower numbers.
2769
2770 As a simple example of the name/number table, consider the following
2771 pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2772 set, so white space - including newlines - is ignored):
2773
2774 (?<date> (?<year>(\d\d)?\d\d) -
2775 (?<month>\d\d) - (?<day>\d\d) )
2776
2777 There are four named subpatterns, so the table has four entries, and
2778 each entry in the table is eight bytes long. The table is as follows,
2779 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2780 as ??:
2781
2782 00 01 d a t e 00 ??
2783 00 05 d a y 00 ?? ??
2784 00 04 m o n t h 00
2785 00 02 y e a r 00 ??
2786
2787 When writing code to extract data from named subpatterns using the
2788 name-to-number map, remember that the length of the entries is likely
2789 to be different for each compiled pattern.
2790
2791 PCRE_INFO_OKPARTIAL
2792
2793 Return 1 if the pattern can be used for partial matching with
2794 pcre_exec(), otherwise 0. The fourth argument should point to an int
2795 variable. From release 8.00, this always returns 1, because the
2796 restrictions that previously applied to partial matching have been
2797 lifted. The pcrepartial documentation gives details of partial match-
2798 ing.
2799
2800 PCRE_INFO_OPTIONS
2801
2802 Return a copy of the options with which the pattern was compiled. The
2803 fourth argument should point to an unsigned long int variable. These
2804 option bits are those specified in the call to pcre_compile(), modified
2805 by any top-level option settings at the start of the pattern itself. In
2806 other words, they are the options that will be in force when matching
2807 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
2808 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
2809 and PCRE_EXTENDED.
2810
2811 A pattern is automatically anchored by PCRE if all of its top-level
2812 alternatives begin with one of the following:
2813
2814 ^ unless PCRE_MULTILINE is set
2815 \A always
2816 \G always
2817 .* if PCRE_DOTALL is set and there are no back
2818 references to the subpattern in which .* appears
2819
2820 For such patterns, the PCRE_ANCHORED bit is set in the options returned
2821 by pcre_fullinfo().
2822
2823 PCRE_INFO_SIZE
2824
2825 Return the size of the compiled pattern in bytes (for both libraries).
2826 The fourth argument should point to a size_t variable. This value does
2827 not include the size of the pcre structure that is returned by
2828 pcre_compile(). The value that is passed as the argument to pcre_mal-
2829 loc() when pcre_compile() is getting memory in which to place the com-
2830 piled data is the value returned by this option plus the size of the
2831 pcre structure. Studying a compiled pattern, with or without JIT, does
2832 not alter the value returned by this option.
2833
2834 PCRE_INFO_STUDYSIZE
2835
2836 Return the size in bytes of the data block pointed to by the study_data
2837 field in a pcre_extra block. If pcre_extra is NULL, or there is no
2838 study data, zero is returned. The fourth argument should point to a
2839 size_t variable. The study_data field is set by pcre_study() to record
2840 information that will speed up matching (see the section entitled
2841 "Studying a pattern" above). The format of the study_data block is pri-
2842 vate, but its length is made available via this option so that it can
2843 be saved and restored (see the pcreprecompile documentation for
2844 details).
2845
2846 PCRE_INFO_FIRSTCHARACTERFLAGS
2847
2848 Return information about the first data unit of any matched string, for
2849 a non-anchored pattern. The fourth argument should point to an int
2850 variable.
2851
2852 If there is a fixed first value, for example, the letter "c" from a
2853 pattern such as (cat|cow|coyote), 1 is returned, and the character
2854 value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
2855
2856 If there is no fixed first value, and if either
2857
2858 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2859 branch starts with "^", or
2860
2861 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2862 set (if it were set, the pattern would be anchored),
2863
2864 2 is returned, indicating that the pattern matches only at the start of
2865 a subject string or after any newline within the string. Otherwise 0 is
2866 returned. For anchored patterns, 0 is returned.
2867
2868 PCRE_INFO_FIRSTCHARACTER
2869
2870 Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTER-
2871 FLAGS returned 1; otherwise returns 0. The fourth argument should point
2872 to an uint_t variable.
2873
2874 In the 8-bit library, the value is always less than 256. In the 16-bit
2875 library the value can be up to 0xffff. In the 32-bit library in UTF-32
2876 mode the value can be up to 0x10ffff, and up to 0xffffffff when not
2877 using UTF-32 mode.
2878
2879 If there is no fixed first value, and if either
2880
2881 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
2882 branch starts with "^", or
2883
2884 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2885 set (if it were set, the pattern would be anchored),
2886
2887 -1 is returned, indicating that the pattern matches only at the start
2888 of a subject string or after any newline within the string. Otherwise
2889 -2 is returned. For anchored patterns, -2 is returned.
2890
2891 PCRE_INFO_REQUIREDCHARFLAGS
2892
2893 Returns 1 if there is a rightmost literal data unit that must exist in
2894 any matched string, other than at its start. The fourth argument should
2895 point to an int variable. If there is no such value, 0 is returned. If
2896 returning 1, the character value itself can be retrieved using
2897 PCRE_INFO_REQUIREDCHAR.
2898
2899 For anchored patterns, a last literal value is recorded only if it fol-
2900 lows something of variable length. For example, for the pattern
2901 /^a\d+z\d+/ the returned value 1 (with "z" returned from
2902 PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
2903
2904 PCRE_INFO_REQUIREDCHAR
2905
2906 Return the value of the rightmost literal data unit that must exist in
2907 any matched string, other than at its start, if such a value has been
2908 recorded. The fourth argument should point to an uint32_t variable. If
2909 there is no such value, 0 is returned.
2910
2911
2912 REFERENCE COUNTS
2913
2914 int pcre_refcount(pcre *code, int adjust);
2915
2916 The pcre_refcount() function is used to maintain a reference count in
2917 the data block that contains a compiled pattern. It is provided for the
2918 benefit of applications that operate in an object-oriented manner,
2919 where different parts of the application may be using the same compiled
2920 pattern, but you want to free the block when they are all done.
2921
2922 When a pattern is compiled, the reference count field is initialized to
2923 zero. It is changed only by calling this function, whose action is to
2924 add the adjust value (which may be positive or negative) to it. The
2925 yield of the function is the new value. However, the value of the count
2926 is constrained to lie between 0 and 65535, inclusive. If the new value
2927 is outside these limits, it is forced to the appropriate limit value.
2928
2929 Except when it is zero, the reference count is not correctly preserved
2930 if a pattern is compiled on one host and then transferred to a host
2931 whose byte-order is different. (This seems a highly unlikely scenario.)
2932
2933
2934 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2935
2936 int pcre_exec(const pcre *code, const pcre_extra *extra,
2937 const char *subject, int length, int startoffset,
2938 int options, int *ovector, int ovecsize);
2939
2940 The function pcre_exec() is called to match a subject string against a
2941 compiled pattern, which is passed in the code argument. If the pattern
2942 was studied, the result of the study should be passed in the extra
2943 argument. You can call pcre_exec() with the same code and extra argu-
2944 ments as many times as you like, in order to match different subject
2945 strings with the same pattern.
2946
2947 This function is the main matching facility of the library, and it
2948 operates in a Perl-like manner. For specialist use there is also an
2949 alternative matching function, which is described below in the section
2950 about the pcre_dfa_exec() function.
2951
2952 In most applications, the pattern will have been compiled (and option-
2953 ally studied) in the same process that calls pcre_exec(). However, it
2954 is possible to save compiled patterns and study data, and then use them
2955 later in different processes, possibly even on different hosts. For a
2956 discussion about this, see the pcreprecompile documentation.
2957
2958 Here is an example of a simple call to pcre_exec():
2959
2960 int rc;
2961 int ovector[30];
2962 rc = pcre_exec(
2963 re, /* result of pcre_compile() */
2964 NULL, /* we didn't study the pattern */
2965 "some string", /* the subject string */
2966 11, /* the length of the subject string */
2967 0, /* start at offset 0 in the subject */
2968 0, /* default options */
2969 ovector, /* vector of integers for substring information */
2970 30); /* number of elements (NOT size in bytes) */
2971
2972 Extra data for pcre_exec()
2973
2974 If the extra argument is not NULL, it must point to a pcre_extra data
2975 block. The pcre_study() function returns such a block (when it doesn't
2976 return NULL), but you can also create one for yourself, and pass addi-
2977 tional information in it. The pcre_extra block contains the following
2978 fields (not necessarily in this order):
2979
2980 unsigned long int flags;
2981 void *study_data;
2982 void *executable_jit;
2983 unsigned long int match_limit;
2984 unsigned long int match_limit_recursion;
2985 void *callout_data;
2986 const unsigned char *tables;
2987 unsigned char **mark;
2988
2989 In the 16-bit version of this structure, the mark field has type
2990 "PCRE_UCHAR16 **".
2991
2992 In the 32-bit version of this structure, the mark field has type
2993 "PCRE_UCHAR32 **".
2994
2995 The flags field is used to specify which of the other fields are set.
2996 The flag bits are:
2997
2998 PCRE_EXTRA_CALLOUT_DATA
2999 PCRE_EXTRA_EXECUTABLE_JIT
3000 PCRE_EXTRA_MARK
3001 PCRE_EXTRA_MATCH_LIMIT
3002 PCRE_EXTRA_MATCH_LIMIT_RECURSION
3003 PCRE_EXTRA_STUDY_DATA
3004 PCRE_EXTRA_TABLES
3005
3006 Other flag bits should be set to zero. The study_data field and some-
3007 times the executable_jit field are set in the pcre_extra block that is
3008 returned by pcre_study(), together with the appropriate flag bits. You
3009 should not set these yourself, but you may add to the block by setting
3010 other fields and their corresponding flag bits.
3011
3012 The match_limit field provides a means of preventing PCRE from using up
3013 a vast amount of resources when running patterns that are not going to
3014 match, but which have a very large number of possibilities in their
3015 search trees. The classic example is a pattern that uses nested unlim-
3016 ited repeats.
3017
3018 Internally, pcre_exec() uses a function called match(), which it calls
3019 repeatedly (sometimes recursively). The limit set by match_limit is
3020 imposed on the number of times this function is called during a match,
3021 which has the effect of limiting the amount of backtracking that can
3022 take place. For patterns that are not anchored, the count restarts from
3023 zero for each position in the subject string.
3024
3025 When pcre_exec() is called with a pattern that was successfully studied
3026 with a JIT option, the way that the matching is executed is entirely
3027 different. However, there is still the possibility of runaway matching
3028 that goes on for a very long time, and so the match_limit value is also
3029 used in this case (but in a different way) to limit how long the match-
3030 ing can continue.
3031
3032 The default value for the limit can be set when PCRE is built; the
3033 default default is 10 million, which handles all but the most extreme
3034 cases. You can override the default by suppling pcre_exec() with a
3035 pcre_extra block in which match_limit is set, and
3036 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
3037 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
3038
3039 The match_limit_recursion field is similar to match_limit, but instead
3040 of limiting the total number of times that match() is called, it limits
3041 the depth of recursion. The recursion depth is a smaller number than
3042 the total number of calls, because not all calls to match() are recur-
3043 sive. This limit is of use only if it is set smaller than match_limit.
3044
3045 Limiting the recursion depth limits the amount of machine stack that
3046 can be used, or, when PCRE has been compiled to use memory on the heap
3047 instead of the stack, the amount of heap memory that can be used. This
3048 limit is not relevant, and is ignored, when matching is done using JIT
3049 compiled code.
3050
3051 The default value for match_limit_recursion can be set when PCRE is
3052 built; the default default is the same value as the default for
3053 match_limit. You can override the default by suppling pcre_exec() with
3054 a pcre_extra block in which match_limit_recursion is set, and
3055 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
3056 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
3057
3058 The callout_data field is used in conjunction with the "callout" fea-
3059 ture, and is described in the pcrecallout documentation.
3060
3061 The tables field is used to pass a character tables pointer to
3062 pcre_exec(); this overrides the value that is stored with the compiled
3063 pattern. A non-NULL value is stored with the compiled pattern only if
3064 custom tables were supplied to pcre_compile() via its tableptr argu-
3065 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
3066 PCRE's internal tables to be used. This facility is helpful when re-
3067 using patterns that have been saved after compiling with an external
3068 set of tables, because the external tables might be at a different
3069 address when pcre_exec() is called. See the pcreprecompile documenta-
3070 tion for a discussion of saving compiled patterns for later use.
3071
3072 If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
3073 set to point to a suitable variable. If the pattern contains any back-
3074 tracking control verbs such as (*MARK:NAME), and the execution ends up
3075 with a name to pass back, a pointer to the name string (zero termi-
3076 nated) is placed in the variable pointed to by the mark field. The
3077 names are within the compiled pattern; if you wish to retain such a
3078 name you must copy it before freeing the memory of a compiled pattern.
3079 If there is no name to pass back, the variable pointed to by the mark
3080 field is set to NULL. For details of the backtracking control verbs,
3081 see the section entitled "Backtracking control" in the pcrepattern doc-
3082 umentation.
3083
3084 Option bits for pcre_exec()
3085
3086 The unused bits of the options argument for pcre_exec() must be zero.
3087 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
3088 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3089 PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
3090 PCRE_PARTIAL_SOFT.
3091
3092 If the pattern was successfully studied with one of the just-in-time
3093 (JIT) compile options, the only supported options for JIT execution are
3094 PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
3095 PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
3096 unsupported option is used, JIT execution is disabled and the normal
3097 interpretive code in pcre_exec() is run.
3098
3099 PCRE_ANCHORED
3100
3101 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
3102 matching position. If a pattern was compiled with PCRE_ANCHORED, or
3103 turned out to be anchored by virtue of its contents, it cannot be made
3104 unachored at matching time.
3105
3106 PCRE_BSR_ANYCRLF
3107 PCRE_BSR_UNICODE
3108
3109 These options (which are mutually exclusive) control what the \R escape
3110 sequence matches. The choice is either to match only CR, LF, or CRLF,
3111 or to match any Unicode newline sequence. These options override the
3112 choice that was made or defaulted when the pattern was compiled.
3113
3114 PCRE_NEWLINE_CR
3115 PCRE_NEWLINE_LF
3116 PCRE_NEWLINE_CRLF
3117 PCRE_NEWLINE_ANYCRLF
3118 PCRE_NEWLINE_ANY
3119
3120 These options override the newline definition that was chosen or
3121 defaulted when the pattern was compiled. For details, see the descrip-
3122 tion of pcre_compile() above. During matching, the newline choice
3123 affects the behaviour of the dot, circumflex, and dollar metacharac-
3124 ters. It may also alter the way the match position is advanced after a
3125 match failure for an unanchored pattern.
3126
3127 When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
3128 set, and a match attempt for an unanchored pattern fails when the cur-
3129 rent position is at a CRLF sequence, and the pattern contains no
3130 explicit matches for CR or LF characters, the match position is
3131 advanced by two characters instead of one, in other words, to after the
3132 CRLF.
3133
3134 The above rule is a compromise that makes the most common cases work as
3135 expected. For example, if the pattern is .+A (and the PCRE_DOTALL
3136 option is not set), it does not match the string "\r\nA" because, after
3137 failing at the start, it skips both the CR and the LF before retrying.
3138 However, the pattern [\r\n]A does match that string, because it con-
3139 tains an explicit CR or LF reference, and so advances only by one char-
3140 acter after the first failure.
3141
3142 An explicit match for CR of LF is either a literal appearance of one of
3143 those characters, or one of the \r or \n escape sequences. Implicit
3144 matches such as [^X] do not count, nor does \s (which includes CR and
3145 LF in the characters that it matches).
3146
3147 Notwithstanding the above, anomalous effects may still occur when CRLF
3148 is a valid newline sequence and explicit \r or \n escapes appear in the
3149 pattern.
3150
3151 PCRE_NOTBOL
3152
3153 This option specifies that first character of the subject string is not
3154 the beginning of a line, so the circumflex metacharacter should not
3155 match before it. Setting this without PCRE_MULTILINE (at compile time)
3156 causes circumflex never to match. This option affects only the behav-
3157 iour of the circumflex metacharacter. It does not affect \A.
3158
3159 PCRE_NOTEOL
3160
3161 This option specifies that the end of the subject string is not the end
3162 of a line, so the dollar metacharacter should not match it nor (except
3163 in multiline mode) a newline immediately before it. Setting this with-
3164 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
3165 option affects only the behaviour of the dollar metacharacter. It does
3166 not affect \Z or \z.
3167
3168 PCRE_NOTEMPTY
3169
3170 An empty string is not considered to be a valid match if this option is
3171 set. If there are alternatives in the pattern, they are tried. If all
3172 the alternatives match the empty string, the entire match fails. For
3173 example, if the pattern
3174
3175 a?b?
3176
3177 is applied to a string not beginning with "a" or "b", it matches an
3178 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
3179 match is not valid, so PCRE searches further into the string for occur-
3180 rences of "a" or "b".
3181
3182 PCRE_NOTEMPTY_ATSTART
3183
3184 This is like PCRE_NOTEMPTY, except that an empty string match that is
3185 not at the start of the subject is permitted. If the pattern is
3186 anchored, such a match can occur only if the pattern contains \K.
3187
3188 Perl has no direct equivalent of PCRE_NOTEMPTY or
3189 PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
3190 match of the empty string within its split() function, and when using
3191 the /g modifier. It is possible to emulate Perl's behaviour after
3192 matching a null string by first trying the match again at the same off-
3193 set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
3194 fails, by advancing the starting offset (see below) and trying an ordi-
3195 nary match again. There is some code that demonstrates how to do this
3196 in the pcredemo sample program. In the most general case, you have to
3197 check to see if the newline convention recognizes CRLF as a newline,
3198 and if so, and the current character is CR followed by LF, advance the
3199 starting offset by two characters instead of one.
3200
3201 PCRE_NO_START_OPTIMIZE
3202
3203 There are a number of optimizations that pcre_exec() uses at the start
3204 of a match, in order to speed up the process. For example, if it is
3205 known that an unanchored match must start with a specific character, it
3206 searches the subject for that character, and fails immediately if it
3207 cannot find it, without actually running the main matching function.
3208 This means that a special item such as (*COMMIT) at the start of a pat-
3209 tern is not considered until after a suitable starting point for the
3210 match has been found. When callouts or (*MARK) items are in use, these
3211 "start-up" optimizations can cause them to be skipped if the pattern is
3212 never actually used. The start-up optimizations are in effect a pre-
3213 scan of the subject that takes place before the pattern is run.
3214
3215 The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
3216 possibly causing performance to suffer, but ensuring that in cases
3217 where the result is "no match", the callouts do occur, and that items
3218 such as (*COMMIT) and (*MARK) are considered at every possible starting
3219 position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
3220 compile time, it cannot be unset at matching time. The use of
3221 PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching
3222 is always done using interpretively.
3223
3224 Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
3225 operation. Consider the pattern
3226
3227 (*COMMIT)ABC
3228
3229 When this is compiled, PCRE records the fact that a match must start
3230 with the character "A". Suppose the subject string is "DEFABC". The
3231 start-up optimization scans along the subject, finds "A" and runs the
3232 first match attempt from there. The (*COMMIT) item means that the pat-
3233 tern must match the current starting position, which in this case, it
3234 does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
3235 set, the initial scan along the subject string does not happen. The
3236 first match attempt is run starting from "D" and when this fails,
3237 (*COMMIT) prevents any further matches being tried, so the overall
3238 result is "no match". If the pattern is studied, more start-up opti-
3239 mizations may be used. For example, a minimum length for the subject
3240 may be recorded. Consider the pattern
3241
3242 (*MARK:A)(X|Y)
3243
3244 The minimum length for a match is one character. If the subject is
3245 "ABC", there will be attempts to match "ABC", "BC", "C", and then
3246 finally an empty string. If the pattern is studied, the final attempt
3247 does not take place, because PCRE knows that the subject is too short,
3248 and so the (*MARK) is never encountered. In this case, studying the
3249 pattern does not affect the overall match result, which is still "no
3250 match", but it does affect the auxiliary information that is returned.
3251
3252 PCRE_NO_UTF8_CHECK
3253
3254 When PCRE_UTF8 is set at compile time, the validity of the subject as a
3255 UTF-8 string is automatically checked when pcre_exec() is subsequently
3256 called. The entire string is checked before any other processing takes
3257 place. The value of startoffset is also checked to ensure that it
3258 points to the start of a UTF-8 character. There is a discussion about
3259 the validity of UTF-8 strings in the pcreunicode page. If an invalid
3260 sequence of bytes is found, pcre_exec() returns the error
3261 PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3262 truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
3263 both cases, information about the precise nature of the error may also
3264 be returned (see the descriptions of these errors in the section enti-
3265 tled Error return values from pcre_exec() below). If startoffset con-
3266 tains a value that does not point to the start of a UTF-8 character (or
3267 to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
3268
3269 If you already know that your subject is valid, and you want to skip
3270 these checks for performance reasons, you can set the
3271 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
3272 do this for the second and subsequent calls to pcre_exec() if you are
3273 making repeated calls to find all the matches in a single subject
3274 string. However, you should be sure that the value of startoffset
3275 points to the start of a character (or the end of the subject). When
3276 PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3277 subject or an invalid value of startoffset is undefined. Your program
3278 may crash.
3279
3280 PCRE_PARTIAL_HARD
3281 PCRE_PARTIAL_SOFT
3282
3283 These options turn on the partial matching feature. For backwards com-
3284 patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
3285 match occurs if the end of the subject string is reached successfully,
3286 but there are not enough subject characters to complete the match. If
3287 this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
3288 matching continues by testing any remaining alternatives. Only if no
3289 complete match can be found is PCRE_ERROR_PARTIAL returned instead of
3290 PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
3291 caller is prepared to handle a partial match, but only if no complete
3292 match can be found.
3293
3294 If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
3295 case, if a partial match is found, pcre_exec() immediately returns
3296 PCRE_ERROR_PARTIAL, without considering any other alternatives. In
3297 other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
3298 ered to be more important that an alternative complete match.
3299
3300 In both cases, the portion of the string that was inspected when the
3301 partial match was found is set as the first matching string. There is a
3302 more detailed discussion of partial and multi-segment matching, with
3303 examples, in the pcrepartial documentation.
3304
3305 The string to be matched by pcre_exec()
3306
3307 The subject string is passed to pcre_exec() as a pointer in subject, a
3308 length in bytes in length, and a starting byte offset in startoffset.
3309 If this is negative or greater than the length of the subject,
3310 pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
3311 zero, the search for a match starts at the beginning of the subject,
3312 and this is by far the most common case. In UTF-8 mode, the byte offset
3313 must point to the start of a UTF-8 character (or the end of the sub-
3314 ject). Unlike the pattern string, the subject may contain binary zero
3315 bytes.
3316
3317 A non-zero starting offset is useful when searching for another match
3318 in the same subject by calling pcre_exec() again after a previous suc-
3319 cess. Setting startoffset differs from just passing over a shortened
3320 string and setting PCRE_NOTBOL in the case of a pattern that begins
3321 with any kind of lookbehind. For example, consider the pattern
3322
3323 \Biss\B
3324
3325 which finds occurrences of "iss" in the middle of words. (\B matches
3326 only if the current position in the subject is not a word boundary.)
3327 When applied to the string "Mississipi" the first call to pcre_exec()
3328 finds the first occurrence. If pcre_exec() is called again with just
3329 the remainder of the subject, namely "issipi", it does not match,
3330 because \B is always false at the start of the subject, which is deemed
3331 to be a word boundary. However, if pcre_exec() is passed the entire
3332 string again, but with startoffset set to 4, it finds the second occur-
3333 rence of "iss" because it is able to look behind the starting point to
3334 discover that it is preceded by a letter.
3335
3336 Finding all the matches in a subject is tricky when the pattern can
3337 match an empty string. It is possible to emulate Perl's /g behaviour by
3338 first trying the match again at the same offset, with the
3339 PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
3340 fails, advancing the starting offset and trying an ordinary match
3341 again. There is some code that demonstrates how to do this in the pcre-
3342 demo sample program. In the most general case, you have to check to see
3343 if the newline convention recognizes CRLF as a newline, and if so, and
3344 the current character is CR followed by LF, advance the starting offset
3345 by two characters instead of one.
3346
3347 If a non-zero starting offset is passed when the pattern is anchored,
3348 one attempt to match at the given offset is made. This can only succeed
3349 if the pattern does not require the match to be at the start of the
3350 subject.
3351
3352 How pcre_exec() returns captured substrings
3353
3354 In general, a pattern matches a certain portion of the subject, and in
3355 addition, further substrings from the subject may be picked out by
3356 parts of the pattern. Following the usage in Jeffrey Friedl's book,
3357 this is called "capturing" in what follows, and the phrase "capturing
3358 subpattern" is used for a fragment of a pattern that picks out a sub-
3359 string. PCRE supports several other kinds of parenthesized subpattern
3360 that do not cause substrings to be captured.
3361
3362 Captured substrings are returned to the caller via a vector of integers
3363 whose address is passed in ovector. The number of elements in the vec-
3364 tor is passed in ovecsize, which must be a non-negative number. Note:
3365 this argument is NOT the size of ovector in bytes.
3366
3367 The first two-thirds of the vector is used to pass back captured sub-
3368 strings, each substring using a pair of integers. The remaining third
3369 of the vector is used as workspace by pcre_exec() while matching cap-
3370 turing subpatterns, and is not available for passing back information.
3371 The number passed in ovecsize should always be a multiple of three. If
3372 it is not, it is rounded down.
3373
3374 When a match is successful, information about captured substrings is
3375 returned in pairs of integers, starting at the beginning of ovector,
3376 and continuing up to two-thirds of its length at the most. The first
3377 element of each pair is set to the byte offset of the first character
3378 in a substring, and the second is set to the byte offset of the first
3379 character after the end of a substring. Note: these values are always
3380 byte offsets, even in UTF-8 mode. They are not character counts.
3381
3382 The first pair of integers, ovector[0] and ovector[1], identify the
3383 portion of the subject string matched by the entire pattern. The next
3384 pair is used for the first capturing subpattern, and so on. The value
3385 returned by pcre_exec() is one more than the highest numbered pair that
3386 has been set. For example, if two substrings have been captured, the
3387 returned value is 3. If there are no capturing subpatterns, the return
3388 value from a successful match is 1, indicating that just the first pair
3389 of offsets has been set.
3390
3391 If a capturing subpattern is matched repeatedly, it is the last portion
3392 of the string that it matched that is returned.
3393
3394 If the vector is too small to hold all the captured substring offsets,
3395 it is used as far as possible (up to two-thirds of its length), and the
3396 function returns a value of zero. If neither the actual string matched
3397 nor any captured substrings are of interest, pcre_exec() may be called
3398 with ovector passed as NULL and ovecsize as zero. However, if the pat-
3399 tern contains back references and the ovector is not big enough to
3400 remember the related substrings, PCRE has to get additional memory for
3401 use during matching. Thus it is usually advisable to supply an ovector
3402 of reasonable size.
3403
3404 There are some cases where zero is returned (indicating vector over-
3405 flow) when in fact the vector is exactly the right size for the final
3406 match. For example, consider the pattern
3407
3408 (a)(?:(b)c|bd)
3409
3410 If a vector of 6 elements (allowing for only 1 captured substring) is
3411 given with subject string "abd", pcre_exec() will try to set the second
3412 captured string, thereby recording a vector overflow, before failing to
3413 match "c" and backing up to try the second alternative. The zero
3414 return, however, does correctly indicate that the maximum number of
3415 slots (namely 2) have been filled. In similar cases where there is tem-
3416 porary overflow, but the final number of used slots is actually less
3417 than the maximum, a non-zero value is returned.
3418
3419 The pcre_fullinfo() function can be used to find out how many capturing
3420 subpatterns there are in a compiled pattern. The smallest size for
3421 ovector that will allow for n captured substrings, in addition to the
3422 offsets of the substring matched by the whole pattern, is (n+1)*3.
3423
3424 It is possible for capturing subpattern number n+1 to match some part
3425 of the subject when subpattern n has not been used at all. For example,
3426 if the string "abc" is matched against the pattern (a|(z))(bc) the
3427 return from the function is 4, and subpatterns 1 and 3 are matched, but
3428 2 is not. When this happens, both values in the offset pairs corre-
3429 sponding to unused subpatterns are set to -1.
3430
3431 Offset values that correspond to unused subpatterns at the end of the
3432 expression are also set to -1. For example, if the string "abc" is
3433 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
3434 matched. The return from the function is 2, because the highest used
3435 capturing subpattern number is 1, and the offsets for for the second
3436 and third capturing subpatterns (assuming the vector is large enough,
3437 of course) are set to -1.
3438
3439 Note: Elements in the first two-thirds of ovector that do not corre-
3440 spond to capturing parentheses in the pattern are never changed. That
3441 is, if a pattern contains n capturing parentheses, no more than ovec-
3442 tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
3443 the first two-thirds) retain whatever values they previously had.
3444
3445 Some convenience functions are provided for extracting the captured
3446 substrings as separate strings. These are described below.
3447
3448 Error return values from pcre_exec()
3449
3450 If pcre_exec() fails, it returns a negative number. The following are
3451 defined in the header file:
3452
3453 PCRE_ERROR_NOMATCH (-1)
3454
3455 The subject string did not match the pattern.
3456
3457 PCRE_ERROR_NULL (-2)
3458
3459 Either code or subject was passed as NULL, or ovector was NULL and
3460 ovecsize was not zero.
3461
3462 PCRE_ERROR_BADOPTION (-3)
3463
3464 An unrecognized bit was set in the options argument.
3465
3466 PCRE_ERROR_BADMAGIC (-4)
3467
3468 PCRE stores a 4-byte "magic number" at the start of the compiled code,
3469 to catch the case when it is passed a junk pointer and to detect when a
3470 pattern that was compiled in an environment of one endianness is run in
3471 an environment with the other endianness. This is the error that PCRE
3472 gives when the magic number is not present.
3473
3474 PCRE_ERROR_UNKNOWN_OPCODE (-5)
3475
3476 While running the pattern match, an unknown item was encountered in the
3477 compiled pattern. This error could be caused by a bug in PCRE or by
3478 overwriting of the compiled pattern.
3479
3480 PCRE_ERROR_NOMEMORY (-6)
3481
3482 If a pattern contains back references, but the ovector that is passed
3483 to pcre_exec() is not big enough to remember the referenced substrings,
3484 PCRE gets a block of memory at the start of matching to use for this
3485 purpose. If the call via pcre_malloc() fails, this error is given. The
3486 memory is automatically freed at the end of matching.
3487
3488 This error is also given if pcre_stack_malloc() fails in pcre_exec().
3489 This can happen only when PCRE has been compiled with --disable-stack-
3490 for-recursion.
3491
3492 PCRE_ERROR_NOSUBSTRING (-7)
3493
3494 This error is used by the pcre_copy_substring(), pcre_get_substring(),
3495 and pcre_get_substring_list() functions (see below). It is never
3496 returned by pcre_exec().
3497
3498 PCRE_ERROR_MATCHLIMIT (-8)
3499
3500 The backtracking limit, as specified by the match_limit field in a
3501 pcre_extra structure (or defaulted) was reached. See the description
3502 above.
3503
3504 PCRE_ERROR_CALLOUT (-9)
3505
3506 This error is never generated by pcre_exec() itself. It is provided for
3507 use by callout functions that want to yield a distinctive error code.
3508 See the pcrecallout documentation for details.
3509
3510 PCRE_ERROR_BADUTF8 (-10)
3511
3512 A string that contains an invalid UTF-8 byte sequence was passed as a
3513 subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
3514 the output vector (ovecsize) is at least 2, the byte offset to the
3515 start of the the invalid UTF-8 character is placed in the first ele-
3516 ment, and a reason code is placed in the second element. The reason
3517 codes are listed in the following section. For backward compatibility,
3518 if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
3519 acter at the end of the subject (reason codes 1 to 5),
3520 PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
3521
3522 PCRE_ERROR_BADUTF8_OFFSET (-11)
3523
3524 The UTF-8 byte sequence that was passed as a subject was checked and
3525 found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
3526 value of startoffset did not point to the beginning of a UTF-8 charac-
3527 ter or the end of the subject.
3528
3529 PCRE_ERROR_PARTIAL (-12)
3530
3531 The subject string did not match, but it did match partially. See the
3532 pcrepartial documentation for details of partial matching.
3533
3534 PCRE_ERROR_BADPARTIAL (-13)
3535
3536 This code is no longer in use. It was formerly returned when the
3537 PCRE_PARTIAL option was used with a compiled pattern containing items
3538 that were not supported for partial matching. From release 8.00
3539 onwards, there are no restrictions on partial matching.
3540
3541 PCRE_ERROR_INTERNAL (-14)
3542
3543 An unexpected internal error has occurred. This error could be caused
3544 by a bug in PCRE or by overwriting of the compiled pattern.
3545
3546 PCRE_ERROR_BADCOUNT (-15)
3547
3548 This error is given if the value of the ovecsize argument is negative.
3549
3550 PCRE_ERROR_RECURSIONLIMIT (-21)
3551
3552 The internal recursion limit, as specified by the match_limit_recursion
3553 field in a pcre_extra structure (or defaulted) was reached. See the
3554 description above.
3555
3556 PCRE_ERROR_BADNEWLINE (-23)
3557
3558 An invalid combination of PCRE_NEWLINE_xxx options was given.
3559
3560 PCRE_ERROR_BADOFFSET (-24)
3561
3562 The value of startoffset was negative or greater than the length of the
3563 subject, that is, the value in length.
3564
3565 PCRE_ERROR_SHORTUTF8 (-25)
3566
3567 This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
3568 string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
3569 option is set. Information about the failure is returned as for
3570 PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
3571 this special error code for PCRE_PARTIAL_HARD precedes the implementa-
3572 tion of returned information; it is retained for backwards compatibil-
3573 ity.
3574
3575 PCRE_ERROR_RECURSELOOP (-26)
3576
3577 This error is returned when pcre_exec() detects a recursion loop within
3578 the pattern. Specifically, it means that either the whole pattern or a
3579 subpattern has been called recursively for the second time at the same
3580 position in the subject string. Some simple patterns that might do this
3581 are detected and faulted at compile time, but more complicated cases,
3582 in particular mutual recursions between two different subpatterns, can-
3583 not be detected until run time.
3584
3585 PCRE_ERROR_JIT_STACKLIMIT (-27)
3586
3587 This error is returned when a pattern that was successfully studied
3588 using a JIT compile option is being matched, but the memory available
3589 for the just-in-time processing stack is not large enough. See the
3590 pcrejit documentation for more details.
3591
3592 PCRE_ERROR_BADMODE (-28)
3593
3594 This error is given if a pattern that was compiled by the 8-bit library
3595 is passed to a 16-bit or 32-bit library function, or vice versa.
3596
3597 PCRE_ERROR_BADENDIANNESS (-29)
3598
3599 This error is given if a pattern that was compiled and saved is
3600 reloaded on a host with different endianness. The utility function
3601 pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3602 so that it runs on the new host.
3603
3604 PCRE_ERROR_JIT_BADOPTION
3605
3606 This error is returned when a pattern that was successfully studied
3607 using a JIT compile option is being matched, but the matching mode
3608 (partial or complete match) does not correspond to any JIT compilation
3609 mode. When the JIT fast path function is used, this error may be also
3610 given for invalid options. See the pcrejit documentation for more
3611 details.
3612
3613 PCRE_ERROR_BADLENGTH (-32)
3614
3615 This error is given if pcre_exec() is called with a negative value for
3616 the length argument.
3617
3618 Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
3619
3620 Reason codes for invalid UTF-8 strings
3621
3622 This section applies only to the 8-bit library. The corresponding
3623 information for the 16-bit and 32-bit libraries is given in the pcre16
3624 and pcre32 pages.
3625
3626 When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3627 UTF8, and the size of the output vector (ovecsize) is at least 2, the
3628 offset of the start of the invalid UTF-8 character is placed in the
3629 first output vector element (ovector[0]) and a reason code is placed in
3630 the second element (ovector[1]). The reason codes are given names in
3631 the pcre.h header file:
3632
3633 PCRE_UTF8_ERR1
3634 PCRE_UTF8_ERR2
3635 PCRE_UTF8_ERR3
3636 PCRE_UTF8_ERR4
3637 PCRE_UTF8_ERR5
3638
3639 The string ends with a truncated UTF-8 character; the code specifies
3640 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
3641 characters to be no longer than 4 bytes, the encoding scheme (origi-
3642 nally defined by RFC 2279) allows for up to 6 bytes, and this is
3643 checked first; hence the possibility of 4 or 5 missing bytes.
3644
3645 PCRE_UTF8_ERR6
3646 PCRE_UTF8_ERR7
3647 PCRE_UTF8_ERR8
3648 PCRE_UTF8_ERR9
3649 PCRE_UTF8_ERR10
3650
3651 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3652 the character do not have the binary value 0b10 (that is, either the
3653 most significant bit is 0, or the next bit is 1).
3654
3655 PCRE_UTF8_ERR11
3656 PCRE_UTF8_ERR12
3657
3658 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
3659 long; these code points are excluded by RFC 3629.
3660
3661 PCRE_UTF8_ERR13
3662
3663 A 4-byte character has a value greater than 0x10fff; these code points
3664 are excluded by RFC 3629.
3665
3666 PCRE_UTF8_ERR14
3667
3668 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
3669 range of code points are reserved by RFC 3629 for use with UTF-16, and
3670 so are excluded from UTF-8.
3671
3672 PCRE_UTF8_ERR15
3673 PCRE_UTF8_ERR16
3674 PCRE_UTF8_ERR17
3675 PCRE_UTF8_ERR18
3676 PCRE_UTF8_ERR19
3677
3678 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
3679 for a value that can be represented by fewer bytes, which is invalid.
3680 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
3681 rect coding uses just one byte.
3682
3683 PCRE_UTF8_ERR20
3684
3685 The two most significant bits of the first byte of a character have the
3686 binary value 0b10 (that is, the most significant bit is 1 and the sec-
3687 ond is 0). Such a byte can only validly occur as the second or subse-
3688 quent byte of a multi-byte character.
3689
3690 PCRE_UTF8_ERR21
3691
3692 The first byte of a character has the value 0xfe or 0xff. These values
3693 can never occur in a valid UTF-8 string.
3694
3695 PCRE_UTF8_ERR22
3696
3697 This error code was formerly used when the presence of a so-called
3698 "non-character" caused an error. Unicode corrigendum #9 makes it clear
3699 that such characters should not cause a string to be rejected, and so
3700 this code is no longer in use and is never returned.
3701
3702
3703 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3704
3705 int pcre_copy_substring(const char *subject, int *ovector,
3706 int stringcount, int stringnumber, char *buffer,
3707 int buffersize);
3708
3709 int pcre_get_substring(const char *subject, int *ovector,
3710 int stringcount, int stringnumber,
3711 const char **stringptr);
3712
3713 int pcre_get_substring_list(const char *subject,
3714 int *ovector, int stringcount, const char ***listptr);
3715
3716 Captured substrings can be accessed directly by using the offsets
3717 returned by pcre_exec() in ovector. For convenience, the functions
3718 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
3719 string_list() are provided for extracting captured substrings as new,
3720 separate, zero-terminated strings. These functions identify substrings
3721 by number. The next section describes functions for extracting named
3722 substrings.
3723
3724 A substring that contains a binary zero is correctly extracted and has
3725 a further zero added on the end, but the result is not, of course, a C
3726 string. However, you can process such a string by referring to the
3727 length that is returned by pcre_copy_substring() and pcre_get_sub-
3728 string(). Unfortunately, the interface to pcre_get_substring_list() is
3729 not adequate for handling strings containing binary zeros, because the
3730 end of the final string is not independently indicated.
3731
3732 The first three arguments are the same for all three of these func-
3733 tions: subject is the subject string that has just been successfully
3734 matched, ovector is a pointer to the vector of integer offsets that was
3735 passed to pcre_exec(), and stringcount is the number of substrings that
3736 were captured by the match, including the substring that matched the
3737 entire regular expression. This is the value returned by pcre_exec() if
3738 it is greater than zero. If pcre_exec() returned zero, indicating that
3739 it ran out of space in ovector, the value passed as stringcount should
3740 be the number of elements in the vector divided by three.
3741
3742 The functions pcre_copy_substring() and pcre_get_substring() extract a
3743 single substring, whose number is given as stringnumber. A value of
3744 zero extracts the substring that matched the entire pattern, whereas
3745 higher values extract the captured substrings. For pcre_copy_sub-
3746 string(), the string is placed in buffer, whose length is given by
3747 buffersize, while for pcre_get_substring() a new block of memory is
3748 obtained via pcre_malloc, and its address is returned via stringptr.
3749 The yield of the function is the length of the string, not including
3750 the terminating zero, or one of these error codes:
3751
3752 PCRE_ERROR_NOMEMORY (-6)
3753
3754 The buffer was too small for pcre_copy_substring(), or the attempt to
3755 get memory failed for pcre_get_substring().
3756
3757 PCRE_ERROR_NOSUBSTRING (-7)
3758
3759 There is no substring whose number is stringnumber.
3760
3761 The pcre_get_substring_list() function extracts all available sub-
3762 strings and builds a list of pointers to them. All this is done in a
3763 single block of memory that is obtained via pcre_malloc. The address of
3764 the memory block is returned via listptr, which is also the start of
3765 the list of string pointers. The end of the list is marked by a NULL
3766 pointer. The yield of the function is zero if all went well, or the
3767 error code
3768
3769 PCRE_ERROR_NOMEMORY (-6)
3770
3771 if the attempt to get the memory block failed.
3772
3773 When any of these functions encounter a substring that is unset, which
3774 can happen when capturing subpattern number n+1 matches some part of
3775 the subject, but subpattern n has not been used at all, they return an
3776 empty string. This can be distinguished from a genuine zero-length sub-
3777 string by inspecting the appropriate offset in ovector, which is nega-
3778 tive for unset substrings.
3779
3780 The two convenience functions pcre_free_substring() and pcre_free_sub-
3781 string_list() can be used to free the memory returned by a previous
3782 call of pcre_get_substring() or pcre_get_substring_list(), respec-
3783 tively. They do nothing more than call the function pointed to by
3784 pcre_free, which of course could be called directly from a C program.
3785 However, PCRE is used in some situations where it is linked via a spe-
3786 cial interface to another programming language that cannot use
3787 pcre_free directly; it is for these cases that the functions are pro-
3788 vided.
3789
3790
3791 EXTRACTING CAPTURED SUBSTRINGS BY NAME
3792
3793 int pcre_get_stringnumber(const pcre *code,
3794 const char *name);
3795
3796 int pcre_copy_named_substring(const pcre *code,
3797 const char *subject, int *ovector,
3798 int stringcount, const char *stringname,
3799 char *buffer, int buffersize);
3800
3801 int pcre_get_named_substring(const pcre *code,
3802 const char *subject, int *ovector,
3803 int stringcount, const char *stringname,
3804 const char **stringptr);
3805
3806 To extract a substring by name, you first have to find associated num-
3807 ber. For example, for this pattern
3808
3809 (a+)b(?<xxx>\d+)...
3810
3811 the number of the subpattern called "xxx" is 2. If the name is known to
3812 be unique (PCRE_DUPNAMES was not set), you can find the number from the
3813 name by calling pcre_get_stringnumber(). The first argument is the com-
3814 piled pattern, and the second is the name. The yield of the function is
3815 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
3816 subpattern of that name.
3817
3818 Given the number, you can extract the substring directly, or use one of
3819 the functions described in the previous section. For convenience, there
3820 are also two functions that do the whole job.
3821
3822 Most of the arguments of pcre_copy_named_substring() and
3823 pcre_get_named_substring() are the same as those for the similarly
3824 named functions that extract by number. As these are described in the
3825 previous section, they are not re-described here. There are just two
3826 differences:
3827
3828 First, instead of a substring number, a substring name is given. Sec-
3829 ond, there is an extra argument, given at the start, which is a pointer
3830 to the compiled pattern. This is needed in order to gain access to the
3831 name-to-number translation table.
3832
3833 These functions call pcre_get_stringnumber(), and if it succeeds, they
3834 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
3835 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
3836 behaviour may not be what you want (see the next section).
3837
3838 Warning: If the pattern uses the (?| feature to set up multiple subpat-
3839 terns with the same number, as described in the section on duplicate
3840 subpattern numbers in the pcrepattern page, you cannot use names to
3841 distinguish the different subpatterns, because names are not included
3842 in the compiled code. The matching process uses only numbers. For this
3843 reason, the use of different names for subpatterns of the same number
3844 causes an error at compile time.
3845
3846
3847 DUPLICATE SUBPATTERN NAMES
3848
3849 int pcre_get_stringtable_entries(const pcre *code,
3850 const char *name, char **first, char **last);
3851
3852 When a pattern is compiled with the PCRE_DUPNAMES option, names for
3853 subpatterns are not required to be unique. (Duplicate names are always
3854 allowed for subpatterns with the same number, created by using the (?|
3855 feature. Indeed, if such subpatterns are named, they are required to
3856 use the same names.)
3857
3858 Normally, patterns with duplicate names are such that in any one match,
3859 only one of the named subpatterns participates. An example is shown in
3860 the pcrepattern documentation.
3861
3862 When duplicates are present, pcre_copy_named_substring() and
3863 pcre_get_named_substring() return the first substring corresponding to
3864 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
3865 (-7) is returned; no data is returned. The pcre_get_stringnumber()
3866 function returns one of the numbers that are associated with the name,
3867 but it is not defined which it is.
3868
3869 If you want to get full details of all captured substrings for a given
3870 name, you must use the pcre_get_stringtable_entries() function. The
3871 first argument is the compiled pattern, and the second is the name. The
3872 third and fourth are pointers to variables which are updated by the
3873 function. After it has run, they point to the first and last entries in
3874 the name-to-number table for the given name. The function itself
3875 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3876 there are none. The format of the table is described above in the sec-
3877 tion entitled Information about a pattern above. Given all the rele-
3878 vant entries for the name, you can extract each of their numbers, and
3879 hence the captured data, if any.
3880
3881
3882 FINDING ALL POSSIBLE MATCHES
3883
3884 The traditional matching function uses a similar algorithm to Perl,
3885 which stops when it finds the first match, starting at a given point in
3886 the subject. If you want to find all possible matches, or the longest
3887 possible match, consider using the alternative matching function (see
3888 below) instead. If you cannot use the alternative function, but still
3889 need to find all possible matches, you can kludge it up by making use
3890 of the callout facility, which is described in the pcrecallout documen-
3891 tation.
3892
3893 What you have to do is to insert a callout right at the end of the pat-
3894 tern. When your callout function is called, extract and save the cur-
3895 rent matched substring. Then return 1, which forces pcre_exec() to
3896 backtrack and try other alternatives. Ultimately, when it runs out of
3897 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3898
3899
3900 OBTAINING AN ESTIMATE OF STACK USAGE
3901
3902 Matching certain patterns using pcre_exec() can use a lot of process
3903 stack, which in certain environments can be rather limited in size.
3904 Some users find it helpful to have an estimate of the amount of stack
3905 that is used by pcre_exec(), to help them set recursion limits, as
3906 described in the pcrestack documentation. The estimate that is output
3907 by pcretest when called with the -m and -C options is obtained by call-
3908 ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
3909 first five arguments.
3910
3911 Normally, if its first argument is NULL, pcre_exec() immediately
3912 returns the negative error code PCRE_ERROR_NULL, but with this special
3913 combination of arguments, it returns instead a negative number whose
3914 absolute value is the approximate stack frame size in bytes. (A nega-
3915 tive number is used so that it is clear that no match has happened.)
3916 The value is approximate because in some cases, recursive calls to
3917 pcre_exec() occur when there are one or two additional variables on the
3918 stack.
3919
3920 If PCRE has been compiled to use the heap instead of the stack for
3921 recursion, the value returned is the size of each block that is
3922 obtained from the heap.
3923
3924
3925 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3926
3927 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
3928 const char *subject, int length, int startoffset,
3929 int options, int *ovector, int ovecsize,
3930 int *workspace, int wscount);
3931
3932 The function pcre_dfa_exec() is called to match a subject string
3933 against a compiled pattern, using a matching algorithm that scans the
3934 subject string just once, and does not backtrack. This has different
3935 characteristics to the normal algorithm, and is not compatible with
3936 Perl. Some of the features of PCRE patterns are not supported. Never-
3937 theless, there are times when this kind of matching can be useful. For
3938 a discussion of the two matching algorithms, and a list of features
3939 that pcre_dfa_exec() does not support, see the pcrematching documenta-
3940 tion.
3941
3942 The arguments for the pcre_dfa_exec() function are the same as for
3943 pcre_exec(), plus two extras. The ovector argument is used in a differ-
3944 ent way, and this is described below. The other common arguments are
3945 used in the same way as for pcre_exec(), so their description is not
3946 repeated here.
3947
3948 The two additional arguments provide workspace for the function. The
3949 workspace vector should contain at least 20 elements. It is used for
3950 keeping track of multiple paths through the pattern tree. More
3951 workspace will be needed for patterns and subjects where there are a
3952 lot of potential matches.
3953
3954 Here is an example of a simple call to pcre_dfa_exec():
3955
3956 int rc;
3957 int ovector[10];
3958 int wspace[20];
3959 rc = pcre_dfa_exec(
3960 re, /* result of pcre_compile() */
3961 NULL, /* we didn't study the pattern */
3962 "some string", /* the subject string */
3963 11, /* the length of the subject string */
3964 0, /* start at offset 0 in the subject */
3965 0, /* default options */
3966 ovector, /* vector of integers for substring information */
3967 10, /* number of elements (NOT size in bytes) */
3968 wspace, /* working space vector */
3969 20); /* number of elements (NOT size in bytes) */
3970
3971 Option bits for pcre_dfa_exec()
3972
3973 The unused bits of the options argument for pcre_dfa_exec() must be
3974 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
3975 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
3976 PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
3977 PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
3978 TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
3979 four of these are exactly the same as for pcre_exec(), so their
3980 description is not repeated here.
3981
3982 PCRE_PARTIAL_HARD
3983 PCRE_PARTIAL_SOFT
3984
3985 These have the same general effect as they do for pcre_exec(), but the
3986 details are slightly different. When PCRE_PARTIAL_HARD is set for
3987 pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
3988 ject is reached and there is still at least one matching possibility
3989 that requires additional characters. This happens even if some complete
3990 matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
3991 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
3992 of the subject is reached, there have been no complete matches, but
3993 there is still at least one matching possibility. The portion of the
3994 string that was inspected when the longest partial match was found is
3995 set as the first matching string in both cases. There is a more
3996 detailed discussion of partial and multi-segment matching, with exam-
3997 ples, in the pcrepartial documentation.
3998
3999 PCRE_DFA_SHORTEST
4000
4001 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
4002 stop as soon as it has found one match. Because of the way the alterna-
4003 tive algorithm works, this is necessarily the shortest possible match
4004 at the first possible matching point in the subject string.
4005
4006 PCRE_DFA_RESTART
4007
4008 When pcre_dfa_exec() returns a partial match, it is possible to call it
4009 again, with additional subject characters, and have it continue with
4010 the same match. The PCRE_DFA_RESTART option requests this action; when
4011 it is set, the workspace and wscount options must reference the same
4012 vector as before because data about the match so far is left in them
4013 after a partial match. There is more discussion of this facility in the
4014 pcrepartial documentation.
4015
4016 Successful returns from pcre_dfa_exec()
4017
4018 When pcre_dfa_exec() succeeds, it may have matched more than one sub-
4019 string in the subject. Note, however, that all the matches from one run
4020 of the function start at the same point in the subject. The shorter
4021 matches are all initial substrings of the longer matches. For example,
4022 if the pattern
4023
4024 <.*>
4025
4026 is matched against the string
4027
4028 This is <something> <something else> <something further> no more
4029
4030 the three matched strings are
4031
4032 <something>
4033 <something> <something else>
4034 <something> <something else> <something further>
4035
4036 On success, the yield of the function is a number greater than zero,
4037 which is the number of matched substrings. The substrings themselves
4038 are returned in ovector. Each string uses two elements; the first is
4039 the offset to the start, and the second is the offset to the end. In
4040 fact, all the strings have the same start offset. (Space could have
4041 been saved by giving this only once, but it was decided to retain some
4042 compatibility with the way pcre_exec() returns data, even though the
4043 meaning of the strings is different.)
4044
4045 The strings are returned in reverse order of length; that is, the long-
4046 est matching string is given first. If there were too many matches to
4047 fit into ovector, the yield of the function is zero, and the vector is
4048 filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
4049 can use the entire ovector for returning matched strings.
4050
4051 Error returns from pcre_dfa_exec()
4052
4053 The pcre_dfa_exec() function returns a negative number when it fails.
4054 Many of the errors are the same as for pcre_exec(), and these are
4055 described above. There are in addition the following errors that are
4056 specific to pcre_dfa_exec():
4057
4058 PCRE_ERROR_DFA_UITEM (-16)
4059
4060 This return is given if pcre_dfa_exec() encounters an item in the pat-
4061 tern that it does not support, for instance, the use of \C or a back
4062 reference.
4063
4064 PCRE_ERROR_DFA_UCOND (-17)
4065
4066 This return is given if pcre_dfa_exec() encounters a condition item
4067 that uses a back reference for the condition, or a test for recursion
4068 in a specific group. These are not supported.
4069
4070 PCRE_ERROR_DFA_UMLIMIT (-18)
4071
4072 This return is given if pcre_dfa_exec() is called with an extra block
4073 that contains a setting of the match_limit or match_limit_recursion
4074 fields. This is not supported (these fields are meaningless for DFA
4075 matching).
4076
4077 PCRE_ERROR_DFA_WSSIZE (-19)
4078
4079 This return is given if pcre_dfa_exec() runs out of space in the
4080 workspace vector.
4081
4082 PCRE_ERROR_DFA_RECURSE (-20)
4083
4084 When a recursive subpattern is processed, the matching function calls
4085 itself recursively, using private vectors for ovector and workspace.
4086 This error is given if the output vector is not large enough. This
4087 should be extremely rare, as a vector of size 1000 is used.
4088
4089 PCRE_ERROR_DFA_BADRESTART (-30)
4090
4091 When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
4092 plausibility checks are made on the contents of the workspace, which
4093 should contain data about the previous partial match. If any of these
4094 checks fail, this error is given.
4095
4096
4097 SEE ALSO
4098
4099 pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
4100 pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
4101 sample(3), pcrestack(3).
4102
4103
4104 AUTHOR
4105
4106 Philip Hazel
4107 University Computing Service
4108 Cambridge CB2 3QH, England.
4109
4110
4111 REVISION
4112
4113 Last updated: 27 February 2013
4114 Copyright (c) 1997-2013 University of Cambridge.
4115 ------------------------------------------------------------------------------
4116
4117
4118 PCRECALLOUT(3) Library Functions Manual PCRECALLOUT(3)
4119
4120
4121
4122 NAME
4123 PCRE - Perl-compatible regular expressions
4124
4125 SYNOPSIS
4126
4127 #include <pcre.h>
4128
4129 int (*pcre_callout)(pcre_callout_block *);
4130
4131 int (*pcre16_callout)(pcre16_callout_block *);
4132
4133 int (*pcre32_callout)(pcre32_callout_block *);
4134
4135
4136 DESCRIPTION
4137
4138 PCRE provides a feature called "callout", which is a means of temporar-
4139 ily passing control to the caller of PCRE in the middle of pattern
4140 matching. The caller of PCRE provides an external function by putting
4141 its entry point in the global variable pcre_callout (pcre16_callout for
4142 the 16-bit library, pcre32_callout for the 32-bit library). By default,
4143 this variable contains NULL, which disables all calling out.
4144
4145 Within a regular expression, (?C) indicates the points at which the
4146 external function is to be called. Different callout points can be
4147 identified by putting a number less than 256 after the letter C. The
4148 default value is zero. For example, this pattern has two callout
4149 points:
4150
4151 (?C1)abc(?C2)def
4152
4153 If the PCRE_AUTO_CALLOUT option bit is set when a pattern is compiled,
4154 PCRE automatically inserts callouts, all with number 255, before each
4155 item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
4156 pattern
4157
4158 A(\d{2}|--)
4159
4160 it is processed as if it were
4161
4162 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4163
4164 Notice that there is a callout before and after each parenthesis and
4165 alternation bar. If the pattern contains a conditional group whose con-
4166 dition is an assertion, an automatic callout is inserted immediately
4167 before the condition. Such a callout may also be inserted explicitly,
4168 for example:
4169
4170 (?(?C9)(?=a)ab|de)
4171
4172 This applies only to assertion conditions (because they are themselves
4173 independent groups).
4174
4175 Automatic callouts can be used for tracking the progress of pattern
4176 matching. The pcretest command has an option that sets automatic call-
4177 outs; when it is used, the output indicates how the pattern is matched.
4178 This is useful information when you are trying to optimize the perfor-
4179 mance of a particular pattern.
4180
4181
4182 MISSING CALLOUTS
4183
4184 You should be aware that, because of optimizations in the way PCRE
4185 matches patterns by default, callouts sometimes do not happen. For
4186 example, if the pattern is
4187
4188 ab(?C4)cd
4189
4190 PCRE knows that any matching string must contain the letter "d". If the
4191 subject string is "abyz", the lack of "d" means that matching doesn't
4192 ever start, and the callout is never reached. However, with "abyd",
4193 though the result is still no match, the callout is obeyed.
4194
4195 If the pattern is studied, PCRE knows the minimum length of a matching
4196 string, and will immediately give a "no match" return without actually
4197 running a match if the subject is not long enough, or, for unanchored
4198 patterns, if it has been scanned far enough.
4199
4200 You can disable these optimizations by passing the PCRE_NO_START_OPTI-
4201 MIZE option to the matching function, or by starting the pattern with
4202 (*NO_START_OPT). This slows down the matching process, but does ensure
4203 that callouts such as the example above are obeyed.
4204
4205
4206 THE CALLOUT INTERFACE
4207
4208 During matching, when PCRE reaches a callout point, the external func-
4209 tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
4210 set). This applies to both normal and DFA matching. The only argument
4211 to the callout function is a pointer to a pcre_callout or
4212 pcre[16|32]_callout block. These structures contains the following
4213 fields:
4214
4215 int version;
4216 int callout_number;
4217 int *offset_vector;
4218 const char *subject; (8-bit version)
4219 PCRE_SPTR16 subject; (16-bit version)
4220 PCRE_SPTR32 subject; (32-bit version)
4221 int subject_length;
4222 int start_match;
4223 int current_position;
4224 int capture_top;
4225 int capture_last;
4226 void *callout_data;
4227 int pattern_position;
4228 int next_item_length;
4229 const unsigned char *mark; (8-bit version)
4230 const PCRE_UCHAR16 *mark; (16-bit version)
4231 const PCRE_UCHAR32 *mark; (32-bit version)
4232
4233 The version field is an integer containing the version number of the
4234 block format. The initial version was 0; the current version is 2. The
4235 version number will change again in future if additional fields are
4236 added, but the intention is never to remove any of the existing fields.
4237
4238 The callout_number field contains the number of the callout, as com-
4239 piled into the pattern (that is, the number after ?C for manual call-
4240 outs, and 255 for automatically generated callouts).
4241
4242 The offset_vector field is a pointer to the vector of offsets that was
4243 passed by the caller to the matching function. When pcre_exec() or
4244 pcre[16|32]_exec() is used, the contents can be inspected, in order to
4245 extract substrings that have been matched so far, in the same way as
4246 for extracting substrings after a match has completed. For the DFA
4247 matching functions, this field is not useful.
4248
4249 The subject and subject_length fields contain copies of the values that
4250 were passed to the matching function.
4251
4252 The start_match field normally contains the offset within the subject
4253 at which the current match attempt started. However, if the escape
4254 sequence \K has been encountered, this value is changed to reflect the
4255 modified starting point. If the pattern is not anchored, the callout
4256 function may be called several times from the same point in the pattern
4257 for different starting points in the subject.
4258
4259 The current_position field contains the offset within the subject of
4260 the current match pointer.
4261
4262 When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top
4263 field contains one more than the number of the highest numbered cap-
4264 tured substring so far. If no substrings have been captured, the value
4265 of capture_top is one. This is always the case when the DFA functions
4266 are used, because they do not support captured substrings.
4267
4268 The capture_last field contains the number of the most recently cap-
4269 tured substring. However, when a recursion exits, the value reverts to
4270 what it was outside the recursion, as do the values of all captured
4271 substrings. If no substrings have been captured, the value of cap-
4272 ture_last is -1. This is always the case for the DFA matching func-
4273 tions.
4274
4275 The callout_data field contains a value that is passed to a matching
4276 function specifically so that it can be passed back in callouts. It is
4277 passed in the callout_data field of a pcre_extra or pcre[16|32]_extra
4278 data structure. If no such data was passed, the value of callout_data
4279 in a callout block is NULL. There is a description of the pcre_extra
4280 structure in the pcreapi documentation.
4281
4282 The pattern_position field is present from version 1 of the callout
4283 structure. It contains the offset to the next item to be matched in the
4284 pattern string.
4285
4286 The next_item_length field is present from version 1 of the callout
4287 structure. It contains the length of the next item to be matched in the
4288 pattern string. When the callout immediately precedes an alternation
4289 bar, a closing parenthesis, or the end of the pattern, the length is
4290 zero. When the callout precedes an opening parenthesis, the length is
4291 that of the entire subpattern.
4292
4293 The pattern_position and next_item_length fields are intended to help
4294 in distinguishing between different automatic callouts, which all have
4295 the same callout number. However, they are set for all callouts.
4296
4297 The mark field is present from version 2 of the callout structure. In
4298 callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer
4299 to the zero-terminated name of the most recently passed (*MARK),
4300 (*PRUNE), or (*THEN) item in the match, or NULL if no such items have
4301 been passed. Instances of (*PRUNE) or (*THEN) without a name do not
4302 obliterate a previous (*MARK). In callouts from the DFA matching func-
4303 tions this field always contains NULL.
4304
4305
4306 RETURN VALUES
4307
4308 The external callout function returns an integer to PCRE. If the value
4309 is zero, matching proceeds as normal. If the value is greater than
4310 zero, matching fails at the current point, but the testing of other
4311 matching possibilities goes ahead, just as if a lookahead assertion had
4312 failed. If the value is less than zero, the match is abandoned, the
4313 matching function returns the negative value.
4314
4315 Negative values should normally be chosen from the set of
4316 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4317 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
4318 reserved for use by callout functions; it will never be used by PCRE
4319 itself.
4320
4321
4322 AUTHOR
4323
4324 Philip Hazel
4325 University Computing Service
4326 Cambridge CB2 3QH, England.
4327
4328
4329 REVISION
4330
4331 Last updated: 03 March 2013
4332 Copyright (c) 1997-2013 University of Cambridge.
4333 ------------------------------------------------------------------------------
4334
4335
4336 PCRECOMPAT(3) Library Functions Manual PCRECOMPAT(3)
4337
4338
4339
4340 NAME
4341 PCRE - Perl-compatible regular expressions
4342
4343 DIFFERENCES BETWEEN PCRE AND PERL
4344
4345 This document describes the differences in the ways that PCRE and Perl
4346 handle regular expressions. The differences described here are with
4347 respect to Perl versions 5.10 and above.
4348
4349 1. PCRE has only a subset of Perl's Unicode support. Details of what it
4350 does have are given in the pcreunicode page.
4351
4352 2. PCRE allows repeat quantifiers only on parenthesized assertions, but
4353 they do not mean what you might think. For example, (?!a){3} does not
4354 assert that the next three characters are not "a". It just asserts that
4355 the next character is not "a" three times (in principle: PCRE optimizes
4356 this to run the assertion just once). Perl allows repeat quantifiers on
4357 other assertions such as \b, but these do not seem to have any use.
4358
4359 3. Capturing subpatterns that occur inside negative lookahead asser-
4360 tions are counted, but their entries in the offsets vector are never
4361 set. Perl sometimes (but not always) sets its numerical variables from
4362 inside negative assertions.
4363
4364 4. Though binary zero characters are supported in the subject string,
4365 they are not allowed in a pattern string because it is passed as a nor-
4366 mal C string, terminated by zero. The escape sequence \0 can be used in
4367 the pattern to represent a binary zero.
4368
4369 5. The following Perl escape sequences are not supported: \l, \u, \L,
4370 \U, and \N when followed by a character name or Unicode value. (\N on
4371 its own, matching a non-newline character, is supported.) In fact these
4372 are implemented by Perl's general string-handling and are not part of
4373 its pattern matching engine. If any of these are encountered by PCRE,
4374 an error is generated by default. However, if the PCRE_JAVASCRIPT_COM-
4375 PAT option is set, \U and \u are interpreted as JavaScript interprets
4376 them.
4377
4378 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
4379 is built with Unicode character property support. The properties that
4380 can be tested with \p and \P are limited to the general category prop-
4381 erties such as Lu and Nd, script names such as Greek or Han, and the
4382 derived properties Any and L&. PCRE does support the Cs (surrogate)
4383 property, which Perl does not; the Perl documentation says "Because
4384 Perl hides the need for the user to understand the internal representa-
4385 tion of Unicode characters, there is no need to implement the somewhat
4386 messy concept of surrogates."
4387
4388 7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
4389 ters in between are treated as literals. This is slightly different
4390 from Perl in that $ and @ are also handled as literals inside the
4391 quotes. In Perl, they cause variable interpolation (but of course PCRE
4392 does not have variables). Note the following examples:
4393
4394 Pattern PCRE matches Perl matches
4395
4396 \Qabc$xyz\E abc$xyz abc followed by the
4397 contents of $xyz
4398 \Qabc\$xyz\E abc\$xyz abc\$xyz
4399 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
4400
4401 The \Q...\E sequence is recognized both inside and outside character
4402 classes.
4403
4404 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
4405 constructions. However, there is support for recursive patterns. This
4406 is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
4407 "callout" feature allows an external function to be called during pat-
4408 tern matching. See the pcrecallout documentation for details.
4409
4410 9. Subpatterns that are called as subroutines (whether or not recur-
4411 sively) are always treated as atomic groups in PCRE. This is like
4412 Python, but unlike Perl. Captured values that are set outside a sub-
4413 routine call can be reference from inside in PCRE, but not in Perl.
4414 There is a discussion that explains these differences in more detail in
4415 the section on recursion differences from Perl in the pcrepattern page.
4416
4417 10. If any of the backtracking control verbs are used in a subpattern
4418 that is called as a subroutine (whether or not recursively), their
4419 effect is confined to that subpattern; it does not extend to the sur-
4420 rounding pattern. This is not always the case in Perl. In particular,
4421 if (*THEN) is present in a group that is called as a subroutine, its
4422 action is limited to that group, even if the group does not contain any
4423 | characters. Note that such subpatterns are processed as anchored at
4424 the point where they are tested.
4425
4426 11. If a pattern contains more than one backtracking control verb, the
4427 first one that is backtracked onto acts. For example, in the pattern
4428 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
4429 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4430 it is the same as PCRE, but there are examples where it differs.
4431
4432 12. Most backtracking verbs in assertions have their normal actions.
4433 They are not confined to the assertion.
4434
4435 13. There are some differences that are concerned with the settings of
4436 captured strings when part of a pattern is repeated. For example,
4437 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
4438 unset, but in PCRE it is set to "b".
4439
4440 14. PCRE's handling of duplicate subpattern numbers and duplicate sub-
4441 pattern names is not as general as Perl's. This is a consequence of the
4442 fact the PCRE works internally just with numbers, using an external ta-
4443 ble to translate between numbers and names. In particular, a pattern
4444 such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
4445 the same number but different names, is not supported, and causes an
4446 error at compile time. If it were allowed, it would not be possible to
4447 distinguish which parentheses matched, because both names map to cap-
4448 turing subpattern number 1. To avoid this confusing situation, an error
4449 is given at compile time.
4450
4451 15. Perl recognizes comments in some places that PCRE does not, for
4452 example, between the ( and ? at the start of a subpattern. If the /x
4453 modifier is set, Perl allows white space between ( and ? but PCRE never
4454 does, even if the PCRE_EXTENDED option is set.
4455
4456 16. In PCRE, the upper/lower case character properties Lu and Ll are
4457 not affected when case-independent matching is specified. For example,
4458 \p{Lu} always matches an upper case letter. I think Perl has changed in
4459 this respect; in the release at the time of writing (5.16), \p{Lu} and
4460 \p{Ll} match all letters, regardless of case, when case independence is
4461 specified.
4462
4463 17. PCRE provides some extensions to the Perl regular expression facil-
4464 ities. Perl 5.10 includes new features that are not in earlier ver-
4465 sions of Perl, some of which (such as named parentheses) have been in
4466 PCRE for some time. This list is with respect to Perl 5.10:
4467
4468 (a) Although lookbehind assertions in PCRE must match fixed length
4469 strings, each alternative branch of a lookbehind assertion can match a
4470 different length of string. Perl requires them all to have the same
4471 length.
4472
4473 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
4474 meta-character matches only at the very end of the string.
4475
4476 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
4477 cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
4478 ignored. (Perl can be made to issue a warning.)
4479
4480 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
4481 fiers is inverted, that is, by default they are not greedy, but if fol-
4482 lowed by a question mark they are.
4483
4484 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
4485 tried only at the first matching position in the subject string.
4486
4487 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
4488 and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
4489 lents.
4490
4491 (g) The \R escape sequence can be restricted to match only CR, LF, or
4492 CRLF by the PCRE_BSR_ANYCRLF option.
4493
4494 (h) The callout facility is PCRE-specific.
4495
4496 (i) The partial matching facility is PCRE-specific.
4497
4498 (j) Patterns compiled by PCRE can be saved and re-used at a later time,
4499 even on different hosts that have the other endianness. However, this
4500 does not apply to optimized data created by the just-in-time compiler.
4501
4502 (k) The alternative matching functions (pcre_dfa_exec(),
4503 pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way and
4504 are not Perl-compatible.
4505
4506 (l) PCRE recognizes some special sequences such as (*CR) at the start
4507 of a pattern that set overall options that cannot be changed within the
4508 pattern.
4509
4510
4511 AUTHOR
4512
4513 Philip Hazel
4514 University Computing Service
4515 Cambridge CB2 3QH, England.
4516
4517
4518 REVISION
4519
4520 Last updated: 19 March 2013
4521 Copyright (c) 1997-2013 University of Cambridge.
4522 ------------------------------------------------------------------------------
4523
4524
4525 PCREPATTERN(3) Library Functions Manual PCREPATTERN(3)
4526
4527
4528
4529 NAME
4530 PCRE - Perl-compatible regular expressions
4531
4532 PCRE REGULAR EXPRESSION DETAILS
4533
4534 The syntax and semantics of the regular expressions that are supported
4535 by PCRE are described in detail below. There is a quick-reference syn-
4536 tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
4537 semantics as closely as it can. PCRE also supports some alternative
4538 regular expression syntax (which does not conflict with the Perl syn-
4539 tax) in order to provide some compatibility with regular expressions in
4540 Python, .NET, and Oniguruma.
4541
4542 Perl's regular expressions are described in its own documentation, and
4543 regular expressions in general are covered in a number of books, some
4544 of which have copious examples. Jeffrey Friedl's "Mastering Regular
4545 Expressions", published by O'Reilly, covers regular expressions in
4546 great detail. This description of PCRE's regular expressions is
4547 intended as reference material.
4548
4549 The original operation of PCRE was on strings of one-byte characters.
4550 However, there is now also support for UTF-8 strings in the original
4551 library, an extra library that supports 16-bit and UTF-16 character
4552 strings, and a third library that supports 32-bit and UTF-32 character
4553 strings. To use these features, PCRE must be built to include appropri-
4554 ate support. When using UTF strings you must either call the compiling
4555 function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the
4556 pattern must start with one of these special sequences:
4557
4558 (*UTF8)
4559 (*UTF16)
4560 (*UTF32)
4561 (*UTF)
4562
4563 (*UTF) is a generic sequence that can be used with any of the
4564 libraries. Starting a pattern with such a sequence is equivalent to
4565 setting the relevant option. This feature is not Perl-compatible. How
4566 setting a UTF mode affects pattern matching is mentioned in several
4567 places below. There is also a summary of features in the pcreunicode
4568 page.
4569
4570 Another special sequence that may appear at the start of a pattern or
4571 in combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
4572
4573 (*UCP)
4574
4575 This has the same effect as setting the PCRE_UCP option: it causes
4576 sequences such as \d and \w to use Unicode properties to determine
4577 character types, instead of recognizing only characters with codes less
4578 than 128 via a lookup table.
4579
4580 If a pattern starts with (*NO_START_OPT), it has the same effect as
4581 setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
4582 time. There are also some more of these special sequences that are con-
4583 cerned with the handling of newlines; they are described below.
4584
4585 The remainder of this document discusses the patterns that are sup-
4586 ported by PCRE when one its main matching functions, pcre_exec()
4587 (8-bit) or pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has
4588 alternative matching functions, pcre_dfa_exec() and
4589 pcre[16|32_dfa_exec(), which match using a different algorithm that is
4590 not Perl-compatible. Some of the features discussed below are not
4591 available when DFA matching is used. The advantages and disadvantages
4592 of the alternative functions, and how they differ from the normal func-
4593 tions, are discussed in the pcrematching page.
4594
4595
4596 EBCDIC CHARACTER CODES
4597
4598 PCRE can be compiled to run in an environment that uses EBCDIC as its
4599 character code rather than ASCII or Unicode (typically a mainframe sys-
4600 tem). In the sections below, character code values are ASCII or Uni-
4601 code; in an EBCDIC environment these characters may have different code
4602 values, and there are no code points greater than 255.
4603
4604
4605 NEWLINE CONVENTIONS
4606
4607 PCRE supports five different conventions for indicating line breaks in
4608 strings: a single CR (carriage return) character, a single LF (line-
4609 feed) character, the two-character sequence CRLF, any of the three pre-
4610 ceding, or any Unicode newline sequence. The pcreapi page has further
4611 discussion about newlines, and shows how to set the newline convention
4612 in the options arguments for the compiling and matching functions.
4613
4614 It is also possible to specify a newline convention by starting a pat-
4615 tern string with one of the following five sequences:
4616
4617 (*CR) carriage return
4618 (*LF) linefeed
4619 (*CRLF) carriage return, followed by linefeed
4620 (*ANYCRLF) any of the three above
4621 (*ANY) all Unicode newline sequences
4622
4623 These override the default and the options given to the compiling func-
4624 tion. For example, on a Unix system where LF is the default newline
4625 sequence, the pattern
4626
4627 (*CR)a.b
4628
4629 changes the convention to CR. That pattern matches "a\nb" because LF is
4630 no longer a newline. Note that these special settings, which are not
4631 Perl-compatible, are recognized only at the very start of a pattern,
4632 and that they must be in upper case. If more than one of them is
4633 present, the last one is used.
4634
4635 The newline convention affects where the circumflex and dollar asser-
4636 tions are true. It also affects the interpretation of the dot metachar-
4637 acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
4638 does not affect what the \R escape sequence matches. By default, this
4639 is any Unicode newline sequence, for Perl compatibility. However, this
4640 can be changed; see the description of \R in the section entitled "New-
4641 line sequences" below. A change of \R setting can be combined with a
4642 change of newline convention.
4643
4644
4645 CHARACTERS AND METACHARACTERS
4646
4647 A regular expression is a pattern that is matched against a subject
4648 string from left to right. Most characters stand for themselves in a
4649 pattern, and match the corresponding characters in the subject. As a
4650 trivial example, the pattern
4651
4652 The quick brown fox
4653
4654 matches a portion of a subject string that is identical to itself. When
4655 caseless matching is specified (the PCRE_CASELESS option), letters are
4656 matched independently of case. In a UTF mode, PCRE always understands
4657 the concept of case for characters whose values are less than 128, so
4658 caseless matching is always possible. For characters with higher val-
4659 ues, the concept of case is supported if PCRE is compiled with Unicode
4660 property support, but not otherwise. If you want to use caseless
4661 matching for characters 128 and above, you must ensure that PCRE is
4662 compiled with Unicode property support as well as with UTF support.
4663
4664 The power of regular expressions comes from the ability to include
4665 alternatives and repetitions in the pattern. These are encoded in the
4666 pattern by the use of metacharacters, which do not stand for themselves
4667 but instead are interpreted in some special way.
4668
4669 There are two different sets of metacharacters: those that are recog-
4670 nized anywhere in the pattern except within square brackets, and those
4671 that are recognized within square brackets. Outside square brackets,
4672 the metacharacters are as follows:
4673
4674 \ general escape character with several uses
4675 ^ assert start of string (or line, in multiline mode)
4676 $ assert end of string (or line, in multiline mode)
4677 . match any character except newline (by default)
4678 [ start character class definition
4679 | start of alternative branch
4680 ( start subpattern
4681 ) end subpattern
4682 ? extends the meaning of (
4683 also 0 or 1 quantifier
4684 also quantifier minimizer
4685 * 0 or more quantifier
4686 + 1 or more quantifier
4687 also "possessive quantifier"
4688 { start min/max quantifier
4689
4690 Part of a pattern that is in square brackets is called a "character
4691 class". In a character class the only metacharacters are:
4692
4693 \ general escape character
4694 ^ negate the class, but only if the first character
4695 - indicates character range
4696 [ POSIX character class (only if followed by POSIX
4697 syntax)
4698 ] terminates the character class
4699
4700 The following sections describe the use of each of the metacharacters.
4701
4702
4703 BACKSLASH
4704
4705 The backslash character has several uses. Firstly, if it is followed by
4706 a character that is not a number or a letter, it takes away any special
4707 meaning that character may have. This use of backslash as an escape
4708 character applies both inside and outside character classes.
4709
4710 For example, if you want to match a * character, you write \* in the
4711 pattern. This escaping action applies whether or not the following
4712 character would otherwise be interpreted as a metacharacter, so it is
4713 always safe to precede a non-alphanumeric with backslash to specify
4714 that it stands for itself. In particular, if you want to match a back-
4715 slash, you write \\.
4716
4717 In a UTF mode, only ASCII numbers and letters have any special meaning
4718 after a backslash. All other characters (in particular, those whose
4719 codepoints are greater than 127) are treated as literals.
4720
4721 If a pattern is compiled with the PCRE_EXTENDED option, white space in
4722 the pattern (other than in a character class) and characters between a
4723 # outside a character class and the next newline are ignored. An escap-
4724 ing backslash can be used to include a white space or # character as
4725 part of the pattern.
4726
4727 If you want to remove the special meaning from a sequence of charac-
4728 ters, you can do so by putting them between \Q and \E. This is differ-
4729 ent from Perl in that $ and @ are handled as literals in \Q...\E
4730 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
4731 tion. Note the following examples:
4732
4733 Pattern PCRE matches Perl matches
4734
4735 \Qabc$xyz\E abc$xyz abc followed by the
4736 contents of $xyz
4737 \Qabc\$xyz\E abc\$xyz abc\$xyz
4738 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
4739
4740 The \Q...\E sequence is recognized both inside and outside character
4741 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
4742 is not followed by \E later in the pattern, the literal interpretation
4743 continues to the end of the pattern (that is, \E is assumed at the
4744 end). If the isolated \Q is inside a character class, this causes an
4745 error, because the character class is not terminated.
4746
4747 Non-printing characters
4748
4749 A second use of backslash provides a way of encoding non-printing char-
4750 acters in patterns in a visible manner. There is no restriction on the
4751 appearance of non-printing characters, apart from the binary zero that
4752 terminates a pattern, but when a pattern is being prepared by text
4753 editing, it is often easier to use one of the following escape
4754 sequences than the binary character it represents:
4755
4756 \a alarm, that is, the BEL character (hex 07)
4757 \cx "control-x", where x is any ASCII character
4758 \e escape (hex 1B)
4759 \f form feed (hex 0C)
4760 \n linefeed (hex 0A)
4761 \r carriage return (hex 0D)
4762 \t tab (hex 09)
4763 \ddd character with octal code ddd, or back reference
4764 \xhh character with hex code hh
4765 \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
4766 \uhhhh character with hex code hhhh (JavaScript mode only)
4767
4768 The precise effect of \cx on ASCII characters is as follows: if x is a
4769 lower case letter, it is converted to upper case. Then bit 6 of the
4770 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
4771 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
4772 hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c
4773 has a value greater than 127, a compile-time error occurs. This locks
4774 out non-ASCII characters in all modes.
4775
4776 The \c facility was designed for use with ASCII characters, but with
4777 the extension to Unicode it is even less useful than it once was. It
4778 is, however, recognized when PCRE is compiled in EBCDIC mode, where
4779 data items are always bytes. In this mode, all values are valid after
4780 \c. If the next character is a lower case letter, it is converted to
4781 upper case. Then the 0xc0 bits of the byte are inverted. Thus \cA
4782 becomes hex 01, as in ASCII (A is C1), but because the EBCDIC letters
4783 are disjoint, \cZ becomes hex 29 (Z is E9), and other characters also
4784 generate different values.
4785
4786 By default, after \x, from zero to two hexadecimal digits are read
4787 (letters can be in upper or lower case). Any number of hexadecimal dig-
4788 its may appear between \x{ and }, but the character code is constrained
4789 as follows:
4790
4791 8-bit non-UTF mode less than 0x100
4792 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
4793 16-bit non-UTF mode less than 0x10000
4794 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
4795 32-bit non-UTF mode less than 0x80000000
4796 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
4797
4798 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
4799 called "surrogate" codepoints), and 0xffef.
4800
4801 If characters other than hexadecimal digits appear between \x{ and },
4802 or if there is no terminating }, this form of escape is not recognized.
4803 Instead, the initial \x will be interpreted as a basic hexadecimal
4804 escape, with no following digits, giving a character whose value is
4805 zero.
4806
4807 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
4808 is as just described only when it is followed by two hexadecimal dig-
4809 its. Otherwise, it matches a literal "x" character. In JavaScript
4810 mode, support for code points greater than 256 is provided by \u, which
4811 must be followed by four hexadecimal digits; otherwise it matches a
4812 literal "u" character. Character codes specified by \u in JavaScript
4813 mode are constrained in the same was as those specified by \x in non-
4814 JavaScript mode.
4815
4816 Characters whose value is less than 256 can be defined by either of the
4817 two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
4818 ence in the way they are handled. For example, \xdc is exactly the same
4819 as \x{dc} (or \u00dc in JavaScript mode).
4820
4821 After \0 up to two further octal digits are read. If there are fewer
4822 than two digits, just those that are present are used. Thus the
4823 sequence \0\x\07 specifies two binary zeros followed by a BEL character
4824 (code value 7). Make sure you supply two digits after the initial zero
4825 if the pattern character that follows is itself an octal digit.
4826
4827 The handling of a backslash followed by a digit other than 0 is compli-
4828 cated. Outside a character class, PCRE reads it and any following dig-
4829 its as a decimal number. If the number is less than 10, or if there
4830 have been at least that many previous capturing left parentheses in the
4831 expression, the entire sequence is taken as a back reference. A
4832 description of how this works is given later, following the discussion
4833 of parenthesized subpatterns.
4834
4835 Inside a character class, or if the decimal number is greater than 9
4836 and there have not been that many capturing subpatterns, PCRE re-reads
4837 up to three octal digits following the backslash, and uses them to gen-
4838 erate a data character. Any subsequent digits stand for themselves. The
4839 value of the character is constrained in the same way as characters
4840 specified in hexadecimal. For example:
4841
4842 \040 is another way of writing an ASCII space
4843 \40 is the same, provided there are fewer than 40
4844 previous capturing subpatterns
4845 \7 is always a back reference
4846 \11 might be a back reference, or another way of
4847 writing a tab
4848 \011 is always a tab
4849 \0113 is a tab followed by the character "3"
4850 \113 might be a back reference, otherwise the
4851 character with octal code 113
4852 \377 might be a back reference, otherwise
4853 the value 255 (decimal)
4854 \81 is either a back reference, or a binary zero
4855 followed by the two characters "8" and "1"
4856
4857 Note that octal values of 100 or greater must not be introduced by a
4858 leading zero, because no more than three octal digits are ever read.
4859
4860 All the sequences that define a single character value can be used both
4861 inside and outside character classes. In addition, inside a character
4862 class, \b is interpreted as the backspace character (hex 08).
4863
4864 \N is not allowed in a character class. \B, \R, and \X are not special
4865 inside a character class. Like other unrecognized escape sequences,
4866 they are treated as the literal characters "B", "R", and "X" by
4867 default, but cause an error if the PCRE_EXTRA option is set. Outside a
4868 character class, these sequences have different meanings.
4869
4870 Unsupported escape sequences
4871
4872 In Perl, the sequences \l, \L, \u, and \U are recognized by its string
4873 handler and used to modify the case of following characters. By
4874 default, PCRE does not support these escape sequences. However, if the
4875 PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
4876 \u can be used to define a character by code point, as described in the
4877 previous section.
4878
4879 Absolute and relative back references
4880
4881 The sequence \g followed by an unsigned or a negative number, option-
4882 ally enclosed in braces, is an absolute or relative back reference. A
4883 named back reference can be coded as \g{name}. Back references are dis-
4884 cussed later, following the discussion of parenthesized subpatterns.
4885
4886 Absolute and relative subroutine calls
4887
4888 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
4889 name or a number enclosed either in angle brackets or single quotes, is
4890 an alternative syntax for referencing a subpattern as a "subroutine".
4891 Details are discussed later. Note that \g{...} (Perl syntax) and
4892 \g<...> (Oniguruma syntax) are not synonymous. The former is a back
4893 reference; the latter is a subroutine call.
4894
4895 Generic character types
4896
4897 Another use of backslash is for specifying generic character types:
4898
4899 \d any decimal digit
4900 \D any character that is not a decimal digit
4901 \h any horizontal white space character
4902 \H any character that is not a horizontal white space character
4903 \s any white space character
4904 \S any character that is not a white space character
4905 \v any vertical white space character
4906 \V any character that is not a vertical white space character
4907 \w any "word" character
4908 \W any "non-word" character
4909
4910 There is also the single sequence \N, which matches a non-newline char-
4911 acter. This is the same as the "." metacharacter when PCRE_DOTALL is
4912 not set. Perl also uses \N to match characters by name; PCRE does not
4913 support this.
4914
4915 Each pair of lower and upper case escape sequences partitions the com-
4916 plete set of characters into two disjoint sets. Any given character
4917 matches one, and only one, of each pair. The sequences can appear both
4918 inside and outside character classes. They each match one character of
4919 the appropriate type. If the current matching point is at the end of
4920 the subject string, all of them fail, because there is no character to
4921 match.
4922
4923 For compatibility with Perl, \s does not match the VT character (code
4924 11). This makes it different from the the POSIX "space" class. The \s
4925 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
4926 "use locale;" is included in a Perl script, \s may match the VT charac-
4927 ter. In PCRE, it never does.
4928
4929 A "word" character is an underscore or any character that is a letter
4930 or digit. By default, the definition of letters and digits is con-
4931 trolled by PCRE's low-valued character tables, and may vary if locale-
4932 specific matching is taking place (see "Locale support" in the pcreapi
4933 page). For example, in a French locale such as "fr_FR" in Unix-like
4934 systems, or "french" in Windows, some character codes greater than 128
4935 are used for accented letters, and these are then matched by \w. The
4936 use of locales with Unicode is discouraged.
4937
4938 By default, in a UTF mode, characters with values greater than 128
4939 never match \d, \s, or \w, and always match \D, \S, and \W. These
4940 sequences retain their original meanings from before UTF support was
4941 available, mainly for efficiency reasons. However, if PCRE is compiled
4942 with Unicode property support, and the PCRE_UCP option is set, the be-
4943 haviour is changed so that Unicode properties are used to determine
4944 character types, as follows:
4945
4946 \d any character that \p{Nd} matches (decimal digit)
4947 \s any character that \p{Z} matches, plus HT, LF, FF, CR
4948 \w any character that \p{L} or \p{N} matches, plus underscore
4949
4950 The upper case escapes match the inverse sets of characters. Note that
4951 \d matches only decimal digits, whereas \w matches any Unicode digit,
4952 as well as any Unicode letter, and underscore. Note also that PCRE_UCP
4953 affects \b, and \B because they are defined in terms of \w and \W.
4954 Matching these sequences is noticeably slower when PCRE_UCP is set.
4955
4956 The sequences \h, \H, \v, and \V are features that were added to Perl
4957 at release 5.10. In contrast to the other sequences, which match only
4958 ASCII characters by default, these always match certain high-valued
4959 codepoints, whether or not PCRE_UCP is set. The horizontal space char-
4960 acters are:
4961
4962 U+0009 Horizontal tab (HT)
4963 U+0020 Space
4964 U+00A0 Non-break space
4965 U+1680 Ogham space mark
4966 U+180E Mongolian vowel separator
4967 U+2000 En quad
4968 U+2001 Em quad
4969 U+2002 En space
4970 U+2003 Em space
4971 U+2004 Three-per-em space
4972 U+2005 Four-per-em space
4973 U+2006 Six-per-em space
4974 U+2007 Figure space
4975 U+2008 Punctuation space
4976 U+2009 Thin space
4977 U+200A Hair space
4978 U+202F Narrow no-break space
4979 U+205F Medium mathematical space
4980 U+3000 Ideographic space
4981
4982 The vertical space characters are:
4983
4984 U+000A Linefeed (LF)
4985 U+000B Vertical tab (VT)
4986 U+000C Form feed (FF)
4987 U+000D Carriage return (CR)
4988 U+0085 Next line (NEL)
4989 U+2028 Line separator
4990 U+2029 Paragraph separator
4991
4992 In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
4993 256 are relevant.
4994
4995 Newline sequences
4996
4997 Outside a character class, by default, the escape sequence \R matches
4998 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
4999 to the following:
5000
5001 (?>\r\n|\n|\x0b|\f|\r|\x85)
5002
5003 This is an example of an "atomic group", details of which are given
5004 below. This particular group matches either the two-character sequence
5005 CR followed by LF, or one of the single characters LF (linefeed,
5006 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
5007 riage return, U+000D), or NEL (next line, U+0085). The two-character
5008 sequence is treated as a single unit that cannot be split.
5009
5010 In other modes, two additional characters whose codepoints are greater
5011 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
5012 rator, U+2029). Unicode character property support is not needed for
5013 these characters to be recognized.
5014
5015 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
5016 the complete set of Unicode line endings) by setting the option
5017 PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
5018 (BSR is an abbrevation for "backslash R".) This can be made the default
5019 when PCRE is built; if this is the case, the other behaviour can be
5020 requested via the PCRE_BSR_UNICODE option. It is also possible to
5021 specify these settings by starting a pattern string with one of the
5022 following sequences:
5023
5024 (*BSR_ANYCRLF) CR, LF, or CRLF only
5025 (*BSR_UNICODE) any Unicode newline sequence
5026
5027 These override the default and the options given to the compiling func-
5028 tion, but they can themselves be overridden by options given to a
5029 matching function. Note that these special settings, which are not
5030 Perl-compatible, are recognized only at the very start of a pattern,
5031 and that they must be in upper case. If more than one of them is
5032 present, the last one is used. They can be combined with a change of
5033 newline convention; for example, a pattern can start with:
5034
5035 (*ANY)(*BSR_ANYCRLF)
5036
5037 They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
5038 or (*UCP) special sequences. Inside a character class, \R is treated as
5039 an unrecognized escape sequence, and so matches the letter "R" by
5040 default, but causes an error if PCRE_EXTRA is set.
5041
5042 Unicode character properties
5043
5044 When PCRE is built with Unicode character property support, three addi-
5045 tional escape sequences that match characters with specific properties
5046 are available. When in 8-bit non-UTF-8 mode, these sequences are of
5047 course limited to testing characters whose codepoints are less than
5048 256, but they do work in this mode. The extra escape sequences are:
5049
5050 \p{xx} a character with the xx property
5051 \P{xx} a character without the xx property
5052 \X a Unicode extended grapheme cluster
5053
5054 The property names represented by xx above are limited to the Unicode
5055 script names, the general category properties, "Any", which matches any
5056 character (including newline), and some special PCRE properties
5057 (described in the next section). Other Perl properties such as "InMu-
5058 sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
5059 does not match any characters, so always causes a match failure.
5060
5061 Sets of Unicode characters are defined as belonging to certain scripts.
5062 A character from one of these sets can be matched using a script name.
5063 For example:
5064
5065 \p{Greek}
5066 \P{Han}
5067
5068 Those that are not part of an identified script are lumped together as
5069 "Common". The current list of scripts is:
5070
5071 Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
5072 Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma,
5073 Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
5074 Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic,
5075 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
5076 gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
5077 tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
5078 Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian,
5079 Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive,
5080 Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko,
5081 Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic,
5082 Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari-
5083 tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese,
5084 Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet,
5085 Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
5086 Yi.
5087
5088 Each character has exactly one Unicode general category property, spec-
5089 ified by a two-letter abbreviation. For compatibility with Perl, nega-
5090 tion can be specified by including a circumflex between the opening
5091 brace and the property name. For example, \p{^Lu} is the same as
5092 \P{Lu}.
5093
5094 If only one letter is specified with \p or \P, it includes all the gen-
5095 eral category properties that start with that letter. In this case, in
5096 the absence of negation, the curly brackets in the escape sequence are
5097 optional; these two examples have the same effect:
5098
5099 \p{L}
5100 \pL
5101
5102 The following general category property codes are supported:
5103
5104 C Other
5105 Cc Control
5106 Cf Format
5107 Cn Unassigned
5108 Co Private use
5109 Cs Surrogate
5110
5111 L Letter
5112 Ll Lower case letter
5113 Lm Modifier letter
5114 Lo Other letter
5115 Lt Title case letter
5116 Lu Upper case letter
5117
5118 M Mark
5119 Mc Spacing mark
5120 Me Enclosing mark
5121 Mn Non-spacing mark
5122
5123 N Number
5124 Nd Decimal number
5125 Nl Letter number
5126 No Other number
5127
5128 P Punctuation
5129 Pc Connector punctuation
5130 Pd Dash punctuation
5131 Pe Close punctuation
5132 Pf Final punctuation
5133 Pi Initial punctuation
5134 Po Other punctuation
5135 Ps Open punctuation
5136
5137 S Symbol
5138 Sc Currency symbol
5139 Sk Modifier symbol
5140 Sm Mathematical symbol
5141 So Other symbol
5142
5143 Z Separator
5144 Zl Line separator
5145 Zp Paragraph separator
5146 Zs Space separator
5147
5148 The special property L& is also supported: it matches a character that
5149 has the Lu, Ll, or Lt property, in other words, a letter that is not
5150 classified as a modifier or "other".
5151
5152 The Cs (Surrogate) property applies only to characters in the range
5153 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
5154 so cannot be tested by PCRE, unless UTF validity checking has been
5155 turned off (see the discussion of PCRE_NO_UTF8_CHECK,
5156 PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
5157 does not support the Cs property.
5158
5159 The long synonyms for property names that Perl supports (such as
5160 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
5161 any of these properties with "Is".
5162
5163 No character that is in the Unicode table has the Cn (unassigned) prop-
5164 erty. Instead, this property is assumed for any code point that is not
5165 in the Unicode table.
5166
5167 Specifying caseless matching does not affect these escape sequences.
5168 For example, \p{Lu} always matches only upper case letters. This is
5169 different from the behaviour of current versions of Perl.
5170
5171 Matching characters by Unicode property is not fast, because PCRE has
5172 to do a multistage table lookup in order to find a character's prop-
5173 erty. That is why the traditional escape sequences such as \d and \w do
5174 not use Unicode properties in PCRE by default, though you can make them
5175 do so by setting the PCRE_UCP option or by starting the pattern with
5176 (*UCP).
5177
5178 Extended grapheme clusters
5179
5180 The \X escape matches any number of Unicode characters that form an
5181 "extended grapheme cluster", and treats the sequence as an atomic group
5182 (see below). Up to and including release 8.31, PCRE matched an ear-
5183 lier, simpler definition that was equivalent to
5184
5185 (?>\PM\pM*)
5186
5187 That is, it matched a character without the "mark" property, followed
5188 by zero or more characters with the "mark" property. Characters with
5189 the "mark" property are typically non-spacing accents that affect the
5190 preceding character.
5191
5192 This simple definition was extended in Unicode to include more compli-
5193 cated kinds of composite character by giving each character a grapheme
5194 breaking property, and creating rules that use these properties to
5195 define the boundaries of extended grapheme clusters. In releases of
5196 PCRE later than 8.31, \X matches one of these clusters.
5197
5198 \X always matches at least one character. Then it decides whether to
5199 add additional characters according to the following rules for ending a
5200 cluster:
5201
5202 1. End at the end of the subject string.
5203
5204 2. Do not end between CR and LF; otherwise end after any control char-
5205 acter.
5206
5207 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
5208 characters are of five types: L, V, T, LV, and LVT. An L character may
5209 be followed by an L, V, LV, or LVT character; an LV or V character may
5210 be followed by a V or T character; an LVT or T character may be follwed
5211 only by a T character.
5212
5213 4. Do not end before extending characters or spacing marks. Characters
5214 with the "mark" property always have the "extend" grapheme breaking
5215 property.
5216
5217 5. Do not end after prepend characters.
5218
5219 6. Otherwise, end the cluster.
5220
5221 PCRE's additional properties
5222
5223 As well as the standard Unicode properties described above, PCRE sup-
5224 ports four more that make it possible to convert traditional escape
5225 sequences such as \w and \s and POSIX character classes to use Unicode
5226 properties. PCRE uses these non-standard, non-Perl properties inter-
5227 nally when PCRE_UCP is set. However, they may also be used explicitly.
5228 These properties are:
5229
5230 Xan Any alphanumeric character
5231 Xps Any POSIX space character
5232 Xsp Any Perl space character
5233 Xwd Any Perl "word" character
5234
5235 Xan matches characters that have either the L (letter) or the N (num-
5236 ber) property. Xps matches the characters tab, linefeed, vertical tab,
5237 form feed, or carriage return, and any other character that has the Z
5238 (separator) property. Xsp is the same as Xps, except that vertical tab
5239 is excluded. Xwd matches the same characters as Xan, plus underscore.
5240
5241 There is another non-standard property, Xuc, which matches any charac-
5242 ter that can be represented by a Universal Character Name in C++ and
5243 other programming languages. These are the characters $, @, ` (grave
5244 accent), and all characters with Unicode code points greater than or
5245 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
5246 most base (ASCII) characters are excluded. (Universal Character Names
5247 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
5248 Note that the Xuc property does not match these sequences but the char-
5249 acters that they represent.)
5250
5251 Resetting the match start
5252
5253 The escape sequence \K causes any previously matched characters not to
5254 be included in the final matched sequence. For example, the pattern:
5255
5256 foo\Kbar
5257
5258 matches "foobar", but reports that it has matched "bar". This feature
5259 is similar to a lookbehind assertion (described below). However, in
5260 this case, the part of the subject before the real match does not have
5261 to be of fixed length, as lookbehind assertions do. The use of \K does
5262 not interfere with the setting of captured substrings. For example,
5263 when the pattern
5264
5265 (foo)\Kbar
5266
5267 matches "foobar", the first substring is still set to "foo".
5268
5269 Perl documents that the use of \K within assertions is "not well
5270 defined". In PCRE, \K is acted upon when it occurs inside positive
5271 assertions, but is ignored in negative assertions.
5272
5273 Simple assertions
5274
5275 The final use of backslash is for certain simple assertions. An asser-
5276 tion specifies a condition that has to be met at a particular point in
5277 a match, without consuming any characters from the subject string. The
5278 use of subpatterns for more complicated assertions is described below.
5279 The backslashed assertions are:
5280
5281 \b matches at a word boundary
5282 \B matches when not at a word boundary
5283 \A matches at the start of the subject
5284 \Z matches at the end of the subject
5285 also matches before a newline at the end of the subject
5286 \z matches only at the end of the subject
5287 \G matches at the first matching position in the subject
5288
5289 Inside a character class, \b has a different meaning; it matches the
5290 backspace character. If any other of these assertions appears in a
5291 character class, by default it matches the corresponding literal char-
5292 acter (for example, \B matches the letter B). However, if the
5293 PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
5294 ated instead.
5295
5296 A word boundary is a position in the subject string where the current
5297 character and the previous character do not both match \w or \W (i.e.
5298 one matches \w and the other matches \W), or the start or end of the
5299 string if the first or last character matches \w, respectively. In a
5300 UTF mode, the meanings of \w and \W can be changed by setting the
5301 PCRE_UCP option. When this is done, it also affects \b and \B. Neither
5302 PCRE nor Perl has a separate "start of word" or "end of word" metase-
5303 quence. However, whatever follows \b normally determines which it is.
5304 For example, the fragment \ba matches "a" at the start of a word.
5305
5306 The \A, \Z, and \z assertions differ from the traditional circumflex
5307 and dollar (described in the next section) in that they only ever match
5308 at the very start and end of the subject string, whatever options are
5309 set. Thus, they are independent of multiline mode. These three asser-
5310 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
5311 affect only the behaviour of the circumflex and dollar metacharacters.
5312 However, if the startoffset argument of pcre_exec() is non-zero, indi-
5313 cating that matching is to start at a point other than the beginning of
5314 the subject, \A can never match. The difference between \Z and \z is
5315 that \Z matches before a newline at the end of the string as well as at
5316 the very end, whereas \z matches only at the end.
5317
5318 The \G assertion is true only when the current matching position is at
5319 the start point of the match, as specified by the startoffset argument
5320 of pcre_exec(). It differs from \A when the value of startoffset is
5321 non-zero. By calling pcre_exec() multiple times with appropriate argu-
5322 ments, you can mimic Perl's /g option, and it is in this kind of imple-
5323 mentation where \G can be useful.
5324
5325 Note, however, that PCRE's interpretation of \G, as the start of the
5326 current match, is subtly different from Perl's, which defines it as the
5327 end of the previous match. In Perl, these can be different when the
5328 previously matched string was empty. Because PCRE does just one match
5329 at a time, it cannot reproduce this behaviour.
5330
5331 If all the alternatives of a pattern begin with \G, the expression is
5332 anchored to the starting match position, and the "anchored" flag is set
5333 in the compiled regular expression.
5334
5335
5336 CIRCUMFLEX AND DOLLAR
5337
5338 The circumflex and dollar metacharacters are zero-width assertions.
5339 That is, they test for a particular condition being true without con-
5340 suming any characters from the subject string.
5341
5342 Outside a character class, in the default matching mode, the circumflex
5343 character is an assertion that is true only if the current matching
5344 point is at the start of the subject string. If the startoffset argu-
5345 ment of pcre_exec() is non-zero, circumflex can never match if the
5346 PCRE_MULTILINE option is unset. Inside a character class, circumflex
5347 has an entirely different meaning (see below).
5348
5349 Circumflex need not be the first character of the pattern if a number
5350 of alternatives are involved, but it should be the first thing in each
5351 alternative in which it appears if the pattern is ever to match that
5352 branch. If all possible alternatives start with a circumflex, that is,
5353 if the pattern is constrained to match only at the start of the sub-
5354 ject, it is said to be an "anchored" pattern. (There are also other
5355 constructs that can cause a pattern to be anchored.)
5356
5357 The dollar character is an assertion that is true only if the current
5358 matching point is at the end of the subject string, or immediately
5359 before a newline at the end of the string (by default). Note, however,
5360 that it does not actually match the newline. Dollar need not be the
5361 last character of the pattern if a number of alternatives are involved,
5362 but it should be the last item in any branch in which it appears. Dol-
5363 lar has no special meaning in a character class.
5364
5365 The meaning of dollar can be changed so that it matches only at the
5366 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
5367 compile time. This does not affect the \Z assertion.
5368
5369 The meanings of the circumflex and dollar characters are changed if the
5370 PCRE_MULTILINE option is set. When this is the case, a circumflex
5371 matches immediately after internal newlines as well as at the start of
5372 the subject string. It does not match after a newline that ends the
5373 string. A dollar matches before any newlines in the string, as well as
5374 at the very end, when PCRE_MULTILINE is set. When newline is specified
5375 as the two-character sequence CRLF, isolated CR and LF characters do
5376 not indicate newlines.
5377
5378 For example, the pattern /^abc$/ matches the subject string "def\nabc"
5379 (where \n represents a newline) in multiline mode, but not otherwise.
5380 Consequently, patterns that are anchored in single line mode because
5381 all branches start with ^ are not anchored in multiline mode, and a
5382 match for circumflex is possible when the startoffset argument of
5383 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
5384 PCRE_MULTILINE is set.
5385
5386 Note that the sequences \A, \Z, and \z can be used to match the start
5387 and end of the subject in both modes, and if all branches of a pattern
5388 start with \A it is always anchored, whether or not PCRE_MULTILINE is
5389 set.
5390
5391
5392 FULL STOP (PERIOD, DOT) AND \N
5393
5394 Outside a character class, a dot in the pattern matches any one charac-
5395 ter in the subject string except (by default) a character that signi-
5396 fies the end of a line.
5397
5398 When a line ending is defined as a single character, dot never matches
5399 that character; when the two-character sequence CRLF is used, dot does
5400 not match CR if it is immediately followed by LF, but otherwise it
5401 matches all characters (including isolated CRs and LFs). When any Uni-
5402 code line endings are being recognized, dot does not match CR or LF or
5403 any of the other line ending characters.
5404
5405 The behaviour of dot with regard to newlines can be changed. If the
5406 PCRE_DOTALL option is set, a dot matches any one character, without
5407 exception. If the two-character sequence CRLF is present in the subject
5408 string, it takes two dots to match it.
5409
5410 The handling of dot is entirely independent of the handling of circum-
5411 flex and dollar, the only relationship being that they both involve
5412 newlines. Dot has no special meaning in a character class.
5413
5414 The escape sequence \N behaves like a dot, except that it is not
5415 affected by the PCRE_DOTALL option. In other words, it matches any
5416 character except one that signifies the end of a line. Perl also uses
5417 \N to match characters by name; PCRE does not support this.
5418
5419
5420 MATCHING A SINGLE DATA UNIT
5421
5422 Outside a character class, the escape sequence \C matches any one data
5423 unit, whether or not a UTF mode is set. In the 8-bit library, one data
5424 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
5425 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
5426 line-ending characters. The feature is provided in Perl in order to
5427 match individual bytes in UTF-8 mode, but it is unclear how it can use-
5428 fully be used. Because \C breaks up characters into individual data
5429 units, matching one unit with \C in a UTF mode means that the rest of
5430 the string may start with a malformed UTF character. This has undefined
5431 results, because PCRE assumes that it is dealing with valid UTF strings
5432 (and by default it checks this at the start of processing unless the
5433 PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option
5434 is used).
5435
5436 PCRE does not allow \C to appear in lookbehind assertions (described
5437 below) in a UTF mode, because this would make it impossible to calcu-
5438 late the length of the lookbehind.
5439
5440 In general, the \C escape sequence is best avoided. However, one way of
5441 using it that avoids the problem of malformed UTF characters is to use
5442 a lookahead to check the length of the next character, as in this pat-
5443 tern, which could be used with a UTF-8 string (ignore white space and
5444 line breaks):
5445
5446 (?| (?=[\x00-\x7f])(\C) |
5447 (?=[\x80-\x{7ff}])(\C)(\C) |
5448 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
5449 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
5450
5451 A group that starts with (?| resets the capturing parentheses numbers
5452 in each alternative (see "Duplicate Subpattern Numbers" below). The
5453 assertions at the start of each branch check the next UTF-8 character
5454 for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
5455 character's individual bytes are then captured by the appropriate num-
5456 ber of groups.
5457
5458
5459 SQUARE BRACKETS AND CHARACTER CLASSES
5460
5461 An opening square bracket introduces a character class, terminated by a
5462 closing square bracket. A closing square bracket on its own is not spe-
5463 cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
5464 a lone closing square bracket causes a compile-time error. If a closing
5465 square bracket is required as a member of the class, it should be the
5466 first data character in the class (after an initial circumflex, if
5467 present) or escaped with a backslash.
5468
5469 A character class matches a single character in the subject. In a UTF
5470 mode, the character may be more than one data unit long. A matched
5471 character must be in the set of characters defined by the class, unless
5472 the first character in the class definition is a circumflex, in which
5473 case the subject character must not be in the set defined by the class.
5474 If a circumflex is actually required as a member of the class, ensure
5475 it is not the first character, or escape it with a backslash.
5476
5477 For example, the character class [aeiou] matches any lower case vowel,
5478 while [^aeiou] matches any character that is not a lower case vowel.
5479 Note that a circumflex is just a convenient notation for specifying the
5480 characters that are in the class by enumerating those that are not. A
5481 class that starts with a circumflex is not an assertion; it still con-
5482 sumes a character from the subject string, and therefore it fails if
5483 the current pointer is at the end of the string.
5484
5485 In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
5486 (0xffff) can be included in a class as a literal string of data units,
5487 or by using the \x{ escaping mechanism.
5488
5489 When caseless matching is set, any letters in a class represent both
5490 their upper case and lower case versions, so for example, a caseless
5491 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
5492 match "A", whereas a caseful version would. In a UTF mode, PCRE always
5493 understands the concept of case for characters whose values are less
5494 than 128, so caseless matching is always possible. For characters with
5495 higher values, the concept of case is supported if PCRE is compiled
5496 with Unicode property support, but not otherwise. If you want to use
5497 caseless matching in a UTF mode for characters 128 and above, you must
5498 ensure that PCRE is compiled with Unicode property support as well as
5499 with UTF support.
5500
5501 Characters that might indicate line breaks are never treated in any
5502 special way when matching character classes, whatever line-ending
5503 sequence is in use, and whatever setting of the PCRE_DOTALL and
5504 PCRE_MULTILINE options is used. A class such as [^a] always matches one
5505 of these characters.
5506
5507 The minus (hyphen) character can be used to specify a range of charac-
5508 ters in a character class. For example, [d-m] matches any letter
5509 between d and m, inclusive. If a minus character is required in a
5510 class, it must be escaped with a backslash or appear in a position
5511 where it cannot be interpreted as indicating a range, typically as the
5512 first or last character in the class.
5513
5514 It is not possible to have the literal character "]" as the end charac-
5515 ter of a range. A pattern such as [W-]46] is interpreted as a class of
5516 two characters ("W" and "-") followed by a literal string "46]", so it
5517 would match "W46]" or "-46]". However, if the "]" is escaped with a
5518 backslash it is interpreted as the end of range, so [W-\]46] is inter-
5519 preted as a class containing a range followed by two other characters.
5520 The octal or hexadecimal representation of "]" can also be used to end
5521 a range.
5522
5523 Ranges operate in the collating sequence of character values. They can
5524 also be used for characters specified numerically, for example
5525 [\000-\037]. Ranges can include any characters that are valid for the
5526 current mode.
5527
5528 If a range that includes letters is used when caseless matching is set,
5529 it matches the letters in either case. For example, [W-c] is equivalent
5530 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
5531 character tables for a French locale are in use, [\xc8-\xcb] matches
5532 accented E characters in both cases. In UTF modes, PCRE supports the
5533 concept of case for characters with values greater than 128 only when
5534 it is compiled with Unicode property support.
5535
5536 The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
5537 \w, and \W may appear in a character class, and add the characters that
5538 they match to the class. For example, [\dABCDEF] matches any hexadeci-
5539 mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
5540 \d, \s, \w and their upper case partners, just as it does when they
5541 appear outside a character class, as described in the section entitled
5542 "Generic character types" above. The escape sequence \b has a different
5543 meaning inside a character class; it matches the backspace character.
5544 The sequences \B, \N, \R, and \X are not special inside a character
5545 class. Like any other unrecognized escape sequences, they are treated
5546 as the literal characters "B", "N", "R", and "X" by default, but cause
5547 an error if the PCRE_EXTRA option is set.
5548
5549 A circumflex can conveniently be used with the upper case character
5550 types to specify a more restricted set of characters than the matching
5551 lower case type. For example, the class [^\W_] matches any letter or
5552 digit, but not underscore, whereas [\w] includes underscore. A positive
5553 character class should be read as "something OR something OR ..." and a
5554 negative class as "NOT something AND NOT something AND NOT ...".
5555
5556 The only metacharacters that are recognized in character classes are
5557 backslash, hyphen (only where it can be interpreted as specifying a
5558 range), circumflex (only at the start), opening square bracket (only
5559 when it can be interpreted as introducing a POSIX class name - see the
5560 next section), and the terminating closing square bracket. However,
5561 escaping other non-alphanumeric characters does no harm.
5562
5563
5564 POSIX CHARACTER CLASSES
5565
5566 Perl supports the POSIX notation for character classes. This uses names
5567 enclosed by [: and :] within the enclosing square brackets. PCRE also
5568 supports this notation. For example,
5569
5570 [01[:alpha:]%]
5571
5572 matches "0", "1", any alphabetic character, or "%". The supported class
5573 names are:
5574
5575 alnum letters and digits
5576 alpha letters
5577 ascii character codes 0 - 127
5578 blank space or tab only
5579 cntrl control characters
5580 digit decimal digits (same as \d)
5581 graph printing characters, excluding space
5582 lower lower case letters
5583 print printing characters, including space
5584 punct printing characters, excluding letters and digits and space
5585 space white space (not quite the same as \s)
5586 upper upper case letters
5587 word "word" characters (same as \w)
5588 xdigit hexadecimal digits
5589
5590 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
5591 and space (32). Notice that this list includes the VT character (code
5592 11). This makes "space" different to \s, which does not include VT (for
5593 Perl compatibility).
5594
5595 The name "word" is a Perl extension, and "blank" is a GNU extension
5596 from Perl 5.8. Another Perl extension is negation, which is indicated
5597 by a ^ character after the colon. For example,
5598
5599 [12[:^digit:]]
5600
5601 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
5602 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5603 these are not supported, and an error is given if they are encountered.
5604
5605 By default, in UTF modes, characters with values greater than 128 do
5606 not match any of the POSIX character classes. However, if the PCRE_UCP
5607 option is passed to pcre_compile(), some of the classes are changed so
5608 that Unicode character properties are used. This is achieved by replac-
5609 ing the POSIX classes by other sequences, as follows:
5610
5611 [:alnum:] becomes \p{Xan}
5612 [:alpha:] becomes \p{L}
5613 [:blank:] becomes \h
5614 [:digit:] becomes \p{Nd}
5615 [:lower:] becomes \p{Ll}
5616 [:space:] becomes \p{Xps}
5617 [:upper:] becomes \p{Lu}
5618 [:word:] becomes \p{Xwd}
5619
5620 Negated versions, such as [:^alpha:] use \P instead of \p. The other
5621 POSIX classes are unchanged, and match only characters with code points
5622 less than 128.
5623
5624
5625 VERTICAL BAR
5626
5627 Vertical bar characters are used to separate alternative patterns. For
5628 example, the pattern
5629
5630 gilbert|sullivan
5631
5632 matches either "gilbert" or "sullivan". Any number of alternatives may
5633 appear, and an empty alternative is permitted (matching the empty
5634 string). The matching process tries each alternative in turn, from left
5635 to right, and the first one that succeeds is used. If the alternatives
5636 are within a subpattern (defined below), "succeeds" means matching the
5637 rest of the main pattern as well as the alternative in the subpattern.
5638
5639
5640 INTERNAL OPTION SETTING
5641
5642 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
5643 PCRE_EXTENDED options (which are Perl-compatible) can be changed from
5644 within the pattern by a sequence of Perl option letters enclosed
5645 between "(?" and ")". The option letters are
5646
5647 i for PCRE_CASELESS
5648 m for PCRE_MULTILINE
5649 s for PCRE_DOTALL
5650 x for PCRE_EXTENDED
5651
5652 For example, (?im) sets caseless, multiline matching. It is also possi-
5653 ble to unset these options by preceding the letter with a hyphen, and a
5654 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
5655 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
5656 is also permitted. If a letter appears both before and after the
5657 hyphen, the option is unset.
5658
5659 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
5660 can be changed in the same way as the Perl-compatible options by using
5661 the characters J, U and X respectively.
5662
5663 When one of these option changes occurs at top level (that is, not
5664 inside subpattern parentheses), the change applies to the remainder of
5665 the pattern that follows. If the change is placed right at the start of
5666 a pattern, PCRE extracts it into the global options (and it will there-
5667 fore show up in data extracted by the pcre_fullinfo() function).
5668
5669 An option change within a subpattern (see below for a description of
5670 subpatterns) affects only that part of the subpattern that follows it,
5671 so
5672
5673 (a(?i)b)c
5674
5675 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
5676 used). By this means, options can be made to have different settings
5677 in different parts of the pattern. Any changes made in one alternative
5678 do carry on into subsequent branches within the same subpattern. For
5679 example,
5680
5681 (a(?i)b|c)
5682
5683 matches "ab", "aB", "c", and "C", even though when matching "C" the
5684 first branch is abandoned before the option setting. This is because
5685 the effects of option settings happen at compile time. There would be
5686 some very weird behaviour otherwise.
5687
5688 Note: There are other PCRE-specific options that can be set by the
5689 application when the compiling or matching functions are called. In
5690 some cases the pattern can contain special leading sequences such as
5691 (*CRLF) to override what the application has set or what has been
5692 defaulted. Details are given in the section entitled "Newline
5693 sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and
5694 (*UCP) leading sequences that can be used to set UTF and Unicode prop-
5695 erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
5696 PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence
5697 is a generic version that can be used with any of the libraries.
5698
5699
5700 SUBPATTERNS
5701
5702 Subpatterns are delimited by parentheses (round brackets), which can be
5703 nested. Turning part of a pattern into a subpattern does two things:
5704
5705 1. It localizes a set of alternatives. For example, the pattern
5706
5707 cat(aract|erpillar|)
5708
5709 matches "cataract", "caterpillar", or "cat". Without the parentheses,
5710 it would match "cataract", "erpillar" or an empty string.
5711
5712 2. It sets up the subpattern as a capturing subpattern. This means
5713 that, when the whole pattern matches, that portion of the subject
5714 string that matched the subpattern is passed back to the caller via the
5715 ovector argument of the matching function. (This applies only to the
5716 traditional matching functions; the DFA matching functions do not sup-
5717 port capturing.)
5718
5719 Opening parentheses are counted from left to right (starting from 1) to
5720 obtain numbers for the capturing subpatterns. For example, if the
5721 string "the red king" is matched against the pattern
5722
5723 the ((red|white) (king|queen))
5724
5725 the captured substrings are "red king", "red", and "king", and are num-
5726 bered 1, 2, and 3, respectively.
5727
5728 The fact that plain parentheses fulfil two functions is not always
5729 helpful. There are often times when a grouping subpattern is required
5730 without a capturing requirement. If an opening parenthesis is followed
5731 by a question mark and a colon, the subpattern does not do any captur-
5732 ing, and is not counted when computing the number of any subsequent
5733 capturing subpatterns. For example, if the string "the white queen" is
5734 matched against the pattern
5735
5736 the ((?:red|white) (king|queen))
5737
5738 the captured substrings are "white queen" and "queen", and are numbered
5739 1 and 2. The maximum number of capturing subpatterns is 65535.
5740
5741 As a convenient shorthand, if any option settings are required at the
5742 start of a non-capturing subpattern, the option letters may appear
5743 between the "?" and the ":". Thus the two patterns
5744
5745 (?i:saturday|sunday)
5746 (?:(?i)saturday|sunday)
5747
5748 match exactly the same set of strings. Because alternative branches are
5749 tried from left to right, and options are not reset until the end of
5750 the subpattern is reached, an option setting in one branch does affect
5751 subsequent branches, so the above patterns match "SUNDAY" as well as
5752 "Saturday".
5753
5754
5755 DUPLICATE SUBPATTERN NUMBERS
5756
5757 Perl 5.10 introduced a feature whereby each alternative in a subpattern
5758 uses the same numbers for its capturing parentheses. Such a subpattern
5759 starts with (?| and is itself a non-capturing subpattern. For example,
5760 consider this pattern:
5761
5762 (?|(Sat)ur|(Sun))day
5763
5764 Because the two alternatives are inside a (?| group, both sets of cap-
5765 turing parentheses are numbered one. Thus, when the pattern matches,
5766 you can look at captured substring number one, whichever alternative
5767 matched. This construct is useful when you want to capture part, but
5768 not all, of one of a number of alternatives. Inside a (?| group, paren-
5769 theses are numbered as usual, but the number is reset at the start of
5770 each branch. The numbers of any capturing parentheses that follow the
5771 subpattern start after the highest number used in any branch. The fol-
5772 lowing example is taken from the Perl documentation. The numbers under-
5773 neath show in which buffer the captured content will be stored.
5774
5775 # before ---------------branch-reset----------- after
5776 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5777 # 1 2 2 3 2 3 4
5778
5779 A back reference to a numbered subpattern uses the most recent value
5780 that is set for that number by any subpattern. The following pattern
5781 matches "abcabc" or "defdef":
5782
5783 /(?|(abc)|(def))\1/
5784
5785 In contrast, a subroutine call to a numbered subpattern always refers
5786 to the first one in the pattern with the given number. The following
5787 pattern matches "abcabc" or "defabc":
5788
5789 /(?|(abc)|(def))(?1)/
5790
5791 If a condition test for a subpattern's having matched refers to a non-
5792 unique number, the test is true if any of the subpatterns of that num-
5793 ber have matched.
5794
5795 An alternative approach to using this "branch reset" feature is to use
5796 duplicate named subpatterns, as described in the next section.
5797
5798
5799 NAMED SUBPATTERNS
5800
5801 Identifying capturing parentheses by number is simple, but it can be
5802 very hard to keep track of the numbers in complicated regular expres-
5803 sions. Furthermore, if an expression is modified, the numbers may
5804 change. To help with this difficulty, PCRE supports the naming of sub-
5805 patterns. This feature was not added to Perl until release 5.10. Python
5806 had the feature earlier, and PCRE introduced it at release 4.0, using
5807 the Python syntax. PCRE now supports both the Perl and the Python syn-
5808 tax. Perl allows identically numbered subpatterns to have different
5809 names, but PCRE does not.
5810
5811 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
5812 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
5813 to capturing parentheses from other parts of the pattern, such as back
5814 references, recursion, and conditions, can be made by name as well as
5815 by number.
5816
5817 Names consist of up to 32 alphanumeric characters and underscores.
5818 Named capturing parentheses are still allocated numbers as well as
5819 names, exactly as if the names were not present. The PCRE API provides
5820 function calls for extracting the name-to-number translation table from
5821 a compiled pattern. There is also a convenience function for extracting
5822 a captured substring by name.
5823
5824 By default, a name must be unique within a pattern, but it is possible
5825 to relax this constraint by setting the PCRE_DUPNAMES option at compile
5826 time. (Duplicate names are also always permitted for subpatterns with
5827 the same number, set up as described in the previous section.) Dupli-
5828 cate names can be useful for patterns where only one instance of the
5829 named parentheses can match. Suppose you want to match the name of a
5830 weekday, either as a 3-letter abbreviation or as the full name, and in
5831 both cases you want to extract the abbreviation. This pattern (ignoring
5832 the line breaks) does the job:
5833
5834 (?<DN>Mon|Fri|Sun)(?:day)?|
5835 (?<DN>Tue)(?:sday)?|
5836 (?<DN>Wed)(?:nesday)?|
5837 (?<DN>Thu)(?:rsday)?|
5838 (?<DN>Sat)(?:urday)?
5839
5840 There are five capturing substrings, but only one is ever set after a
5841 match. (An alternative way of solving this problem is to use a "branch
5842 reset" subpattern, as described in the previous section.)
5843
5844 The convenience function for extracting the data by name returns the
5845 substring for the first (and in this example, the only) subpattern of
5846 that name that matched. This saves searching to find which numbered
5847 subpattern it was.
5848
5849 If you make a back reference to a non-unique named subpattern from
5850 elsewhere in the pattern, the one that corresponds to the first occur-
5851 rence of the name is used. In the absence of duplicate numbers (see the
5852 previous section) this is the one with the lowest number. If you use a
5853 named reference in a condition test (see the section about conditions
5854 below), either to check whether a subpattern has matched, or to check
5855 for recursion, all subpatterns with the same name are tested. If the
5856 condition is true for any one of them, the overall condition is true.
5857 This is the same behaviour as testing by number. For further details of
5858 the interfaces for handling named subpatterns, see the pcreapi documen-
5859 tation.
5860
5861 Warning: You cannot use different names to distinguish between two sub-
5862 patterns with the same number because PCRE uses only the numbers when
5863 matching. For this reason, an error is given at compile time if differ-
5864 ent names are given to subpatterns with the same number. However, you
5865 can give the same name to subpatterns with the same number, even when
5866 PCRE_DUPNAMES is not set.
5867
5868
5869 REPETITION
5870
5871 Repetition is specified by quantifiers, which can follow any of the
5872 following items:
5873
5874 a literal data character
5875 the dot metacharacter
5876 the \C escape sequence
5877 the \X escape sequence
5878 the \R escape sequence
5879 an escape such as \d or \pL that matches a single character
5880 a character class
5881 a back reference (see next section)
5882 a parenthesized subpattern (including assertions)
5883 a subroutine call to a subpattern (recursive or otherwise)
5884
5885 The general repetition quantifier specifies a minimum and maximum num-
5886 ber of permitted matches, by giving the two numbers in curly brackets
5887 (braces), separated by a comma. The numbers must be less than 65536,
5888 and the first must be less than or equal to the second. For example:
5889
5890 z{2,4}
5891
5892 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
5893 special character. If the second number is omitted, but the comma is
5894 present, there is no upper limit; if the second number and the comma
5895 are both omitted, the quantifier specifies an exact number of required
5896 matches. Thus
5897
5898 [aeiou]{3,}
5899
5900 matches at least 3 successive vowels, but may match many more, while
5901
5902 \d{8}
5903
5904 matches exactly 8 digits. An opening curly bracket that appears in a
5905 position where a quantifier is not allowed, or one that does not match
5906 the syntax of a quantifier, is taken as a literal character. For exam-
5907 ple, {,6} is not a quantifier, but a literal string of four characters.
5908
5909 In UTF modes, quantifiers apply to characters rather than to individual
5910 data units. Thus, for example, \x{100}{2} matches two characters, each
5911 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
5912 larly, \X{3} matches three Unicode extended grapheme clusters, each of
5913 which may be several data units long (and they may be of different
5914 lengths).
5915
5916 The quantifier {0} is permitted, causing the expression to behave as if
5917 the previous item and the quantifier were not present. This may be use-
5918 ful for subpatterns that are referenced as subroutines from elsewhere
5919 in the pattern (but see also the section entitled "Defining subpatterns
5920 for use by reference only" below). Items other than subpatterns that
5921 have a {0} quantifier are omitted from the compiled pattern.
5922
5923 For convenience, the three most common quantifiers have single-charac-
5924 ter abbreviations:
5925
5926 * is equivalent to {0,}
5927 + is equivalent to {1,}
5928 ? is equivalent to {0,1}
5929
5930 It is possible to construct infinite loops by following a subpattern
5931 that can match no characters with a quantifier that has no upper limit,
5932 for example:
5933
5934 (a?)*
5935
5936 Earlier versions of Perl and PCRE used to give an error at compile time
5937 for such patterns. However, because there are cases where this can be
5938 useful, such patterns are now accepted, but if any repetition of the
5939 subpattern does in fact match no characters, the loop is forcibly bro-
5940 ken.
5941
5942 By default, the quantifiers are "greedy", that is, they match as much
5943 as possible (up to the maximum number of permitted times), without
5944 causing the rest of the pattern to fail. The classic example of where
5945 this gives problems is in trying to match comments in C programs. These
5946 appear between /* and */ and within the comment, individual * and /
5947 characters may appear. An attempt to match C comments by applying the
5948 pattern
5949
5950 /\*.*\*/
5951
5952 to the string
5953
5954 /* first comment */ not comment /* second comment */
5955
5956 fails, because it matches the entire string owing to the greediness of
5957 the .* item.
5958
5959 However, if a quantifier is followed by a question mark, it ceases to
5960 be greedy, and instead matches the minimum number of times possible, so
5961 the pattern
5962
5963 /\*.*?\*/
5964
5965 does the right thing with the C comments. The meaning of the various
5966 quantifiers is not otherwise changed, just the preferred number of
5967 matches. Do not confuse this use of question mark with its use as a
5968 quantifier in its own right. Because it has two uses, it can sometimes
5969 appear doubled, as in
5970
5971 \d??\d
5972
5973 which matches one digit by preference, but can match two if that is the
5974 only way the rest of the pattern matches.
5975
5976 If the PCRE_UNGREEDY option is set (an option that is not available in
5977 Perl), the quantifiers are not greedy by default, but individual ones
5978 can be made greedy by following them with a question mark. In other
5979 words, it inverts the default behaviour.
5980
5981 When a parenthesized subpattern is quantified with a minimum repeat
5982 count that is greater than 1 or with a limited maximum, more memory is
5983 required for the compiled pattern, in proportion to the size of the
5984 minimum or maximum.
5985
5986 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
5987 alent to Perl's /s) is set, thus allowing the dot to match newlines,
5988 the pattern is implicitly anchored, because whatever follows will be
5989 tried against every character position in the subject string, so there
5990 is no point in retrying the overall match at any position after the
5991 first. PCRE normally treats such a pattern as though it were preceded
5992 by \A.
5993
5994 In cases where it is known that the subject string contains no new-
5995 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
5996 mization, or alternatively using ^ to indicate anchoring explicitly.
5997
5998 However, there are some cases where the optimization cannot be used.
5999 When .* is inside capturing parentheses that are the subject of a back
6000 reference elsewhere in the pattern, a match at the start may fail where
6001 a later one succeeds. Consider, for example:
6002
6003 (.*)abc\1
6004
6005 If the subject is "xyz123abc123" the match point is the fourth charac-
6006 ter. For this reason, such a pattern is not implicitly anchored.
6007
6008 Another case where implicit anchoring is not applied is when the lead-
6009 ing .* is inside an atomic group. Once again, a match at the start may
6010 fail where a later one succeeds. Consider this pattern:
6011
6012 (?>.*?a)b
6013
6014 It matches "ab" in the subject "aab". The use of the backtracking con-
6015 trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
6016
6017 When a capturing subpattern is repeated, the value captured is the sub-
6018 string that matched the final iteration. For example, after
6019
6020 (tweedle[dume]{3}\s*)+
6021
6022 has matched "tweedledum tweedledee" the value of the captured substring
6023 is "tweedledee". However, if there are nested capturing subpatterns,
6024 the corresponding captured values may have been set in previous itera-
6025 tions. For example, after
6026
6027 /(a|(b))+/
6028
6029 matches "aba" the value of the second captured substring is "b".
6030
6031
6032 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
6033
6034 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
6035 repetition, failure of what follows normally causes the repeated item
6036 to be re-evaluated to see if a different number of repeats allows the
6037 rest of the pattern to match. Sometimes it is useful to prevent this,
6038 either to change the nature of the match, or to cause it fail earlier
6039 than it otherwise might, when the author of the pattern knows there is
6040 no point in carrying on.
6041