/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 73 - (show annotations)
Sat Feb 24 21:40:30 2007 UTC (12 years, 6 months ago) by nigel
File MIME type: text/plain
File size: 151738 byte(s)
Error occurred while calculating annotation data.
Load pcre-4.5 into code/trunk.
1 This file contains a concatenation of the PCRE man pages, converted to plain
2 text format for ease of searching with a text editor, or for use on systems
3 that do not have a man page processor. The small individual files that give
4 synopses of each function in the library have not been included. There are
5 separate text files for the pcregrep and pcretest commands.
6 -----------------------------------------------------------------------------
7
8 PCRE(3) PCRE(3)
9
10
11
12 NAME
13 PCRE - Perl-compatible regular expressions
14
15 DESCRIPTION
16
17 The PCRE library is a set of functions that implement regular expres-
18 sion pattern matching using the same syntax and semantics as Perl, with
19 just a few differences. The current implementation of PCRE (release
20 4.x) corresponds approximately with Perl 5.8, including support for
21 UTF-8 encoded strings. However, this support has to be explicitly
22 enabled; it is not the default.
23
24 PCRE is written in C and released as a C library. However, a number of
25 people have written wrappers and interfaces of various kinds. A C++
26 class is included in these contributions, which can be found in the
27 Contrib directory at the primary FTP site, which is:
28
29 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30
31 Details of exactly which Perl regular expression features are and are
32 not supported by PCRE are given in separate documents. See the pcrepat-
33 tern and pcrecompat pages.
34
35 Some features of PCRE can be included, excluded, or changed when the
36 library is built. The pcre_config() function makes it possible for a
37 client to discover which features are available. Documentation about
38 building PCRE for various operating systems can be found in the README
39 file in the source distribution.
40
41
42 USER DOCUMENTATION
43
44 The user documentation for PCRE has been split up into a number of dif-
45 ferent sections. In the "man" format, each of these is a separate "man
46 page". In the HTML format, each is a separate page, linked from the
47 index page. In the plain text format, all the sections are concate-
48 nated, for ease of searching. The sections are as follows:
49
50 pcre this document
51 pcreapi details of PCRE's native API
52 pcrebuild options for building PCRE
53 pcrecallout details of the callout feature
54 pcrecompat discussion of Perl compatibility
55 pcregrep description of the pcregrep command
56 pcrepattern syntax and semantics of supported
57 regular expressions
58 pcreperform discussion of performance issues
59 pcreposix the POSIX-compatible API
60 pcresample discussion of the sample program
61 pcretest the pcretest testing command
62
63 In addition, in the "man" and HTML formats, there is a short page for
64 each library function, listing its arguments and results.
65
66
67 LIMITATIONS
68
69 There are some size limitations in PCRE but it is hoped that they will
70 never in practice be relevant.
71
72 The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
73 is compiled with the default internal linkage size of 2. If you want to
74 process regular expressions that are truly enormous, you can compile
75 PCRE with an internal linkage size of 3 or 4 (see the README file in
76 the source distribution and the pcrebuild documentation for details).
77 If these cases the limit is substantially larger. However, the speed
78 of execution will be slower.
79
80 All values in repeating quantifiers must be less than 65536. The maxi-
81 mum number of capturing subpatterns is 65535.
82
83 There is no limit to the number of non-capturing subpatterns, but the
84 maximum depth of nesting of all kinds of parenthesized subpattern,
85 including capturing subpatterns, assertions, and other types of subpat-
86 tern, is 200.
87
88 The maximum length of a subject string is the largest positive number
89 that an integer variable can hold. However, PCRE uses recursion to han-
90 dle subpatterns and indefinite repetition. This means that the avail-
91 able stack space may limit the size of a subject string that can be
92 processed by certain patterns.
93
94
95 UTF-8 SUPPORT
96
97 Starting at release 3.3, PCRE has had some support for character
98 strings encoded in the UTF-8 format. For release 4.0 this has been
99 greatly extended to cover most common requirements.
100
101 In order process UTF-8 strings, you must build PCRE to include UTF-8
102 support in the code, and, in addition, you must call pcre_compile()
103 with the PCRE_UTF8 option flag. When you do this, both the pattern and
104 any subject strings that are matched against it are treated as UTF-8
105 strings instead of just strings of bytes.
106
107 If you compile PCRE with UTF-8 support, but do not use it at run time,
108 the library will be a bit bigger, but the additional run time overhead
109 is limited to testing the PCRE_UTF8 flag in several places, so should
110 not be very large.
111
112 The following comments apply when PCRE is running in UTF-8 mode:
113
114 1. When you set the PCRE_UTF8 flag, the strings passed as patterns and
115 subjects are checked for validity on entry to the relevant functions.
116 If an invalid UTF-8 string is passed, an error return is given. In some
117 situations, you may already know that your strings are valid, and
118 therefore want to skip these checks in order to improve performance. If
119 you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time,
120 PCRE assumes that the pattern or subject it is given (respectively)
121 contains only valid UTF-8 codes. In this case, it does not diagnose an
122 invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
123 PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may
124 crash.
125
126 2. In a pattern, the escape sequence \x{...}, where the contents of the
127 braces is a string of hexadecimal digits, is interpreted as a UTF-8
128 character whose code number is the given hexadecimal number, for exam-
129 ple: \x{1234}. If a non-hexadecimal digit appears between the braces,
130 the item is not recognized. This escape sequence can be used either as
131 a literal, or within a character class.
132
133 3. The original hexadecimal escape sequence, \xhh, matches a two-byte
134 UTF-8 character if the value is greater than 127.
135
136 4. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
137 vidual bytes, for example: \x{100}{3}.
138
139 5. The dot metacharacter matches one UTF-8 character instead of a
140 single byte.
141
142 6. The escape sequence \C can be used to match a single byte in UTF-8
143 mode, but its use can lead to some strange effects.
144
145 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
146 test characters of any code value, but the characters that PCRE recog-
147 nizes as digits, spaces, or word characters remain the same set as
148 before, all with values less than 256.
149
150 8. Case-insensitive matching applies only to characters whose values
151 are less than 256. PCRE does not support the notion of "case" for
152 higher-valued characters.
153
154 9. PCRE does not support the use of Unicode tables and properties or
155 the Perl escapes \p, \P, and \X.
156
157
158 AUTHOR
159
160 Philip Hazel <ph10@cam.ac.uk>
161 University Computing Service,
162 Cambridge CB2 3QG, England.
163 Phone: +44 1223 334714
164
165 Last updated: 20 August 2003
166 Copyright (c) 1997-2003 University of Cambridge.
167 -----------------------------------------------------------------------------
168
169 PCRE(3) PCRE(3)
170
171
172
173 NAME
174 PCRE - Perl-compatible regular expressions
175
176 PCRE BUILD-TIME OPTIONS
177
178 This document describes the optional features of PCRE that can be
179 selected when the library is compiled. They are all selected, or dese-
180 lected, by providing options to the configure script which is run
181 before the make command. The complete list of options for configure
182 (which includes the standard ones such as the selection of the instal-
183 lation directory) can be obtained by running
184
185 ./configure --help
186
187 The following sections describe certain options whose names begin with
188 --enable or --disable. These settings specify changes to the defaults
189 for the configure command. Because of the way that configure works,
190 --enable and --disable always come in pairs, so the complementary
191 option always exists as well, but as it specifies the default, it is
192 not described.
193
194
195 UTF-8 SUPPORT
196
197 To build PCRE with support for UTF-8 character strings, add
198
199 --enable-utf8
200
201 to the configure command. Of itself, this does not make PCRE treat
202 strings as UTF-8. As well as compiling PCRE with this option, you also
203 have have to set the PCRE_UTF8 option when you call the pcre_compile()
204 function.
205
206
207 CODE VALUE OF NEWLINE
208
209 By default, PCRE treats character 10 (linefeed) as the newline charac-
210 ter. This is the normal newline character on Unix-like systems. You can
211 compile PCRE to use character 13 (carriage return) instead by adding
212
213 --enable-newline-is-cr
214
215 to the configure command. For completeness there is also a --enable-
216 newline-is-lf option, which explicitly specifies linefeed as the new-
217 line character.
218
219
220 BUILDING SHARED AND STATIC LIBRARIES
221
222 The PCRE building process uses libtool to build both shared and static
223 Unix libraries by default. You can suppress one of these by adding one
224 of
225
226 --disable-shared
227 --disable-static
228
229 to the configure command, as required.
230
231
232 POSIX MALLOC USAGE
233
234 When PCRE is called through the POSIX interface (see the pcreposix
235 documentation), additional working storage is required for holding the
236 pointers to capturing substrings because PCRE requires three integers
237 per substring, whereas the POSIX interface provides only two. If the
238 number of expected substrings is small, the wrapper function uses space
239 on the stack, because this is faster than using malloc() for each call.
240 The default threshold above which the stack is no longer used is 10; it
241 can be changed by adding a setting such as
242
243 --with-posix-malloc-threshold=20
244
245 to the configure command.
246
247
248 LIMITING PCRE RESOURCE USAGE
249
250 Internally, PCRE has a function called match() which it calls repeat-
251 edly (possibly recursively) when performing a matching operation. By
252 limiting the number of times this function may be called, a limit can
253 be placed on the resources used by a single call to pcre_exec(). The
254 limit can be changed at run time, as described in the pcreapi documen-
255 tation. The default is 10 million, but this can be changed by adding a
256 setting such as
257
258 --with-match-limit=500000
259
260 to the configure command.
261
262
263 HANDLING VERY LARGE PATTERNS
264
265 Within a compiled pattern, offset values are used to point from one
266 part to another (for example, from an opening parenthesis to an alter-
267 nation metacharacter). By default two-byte values are used for these
268 offsets, leading to a maximum size for a compiled pattern of around
269 64K. This is sufficient to handle all but the most gigantic patterns.
270 Nevertheless, some people do want to process enormous patterns, so it
271 is possible to compile PCRE to use three-byte or four-byte offsets by
272 adding a setting such as
273
274 --with-link-size=3
275
276 to the configure command. The value given must be 2, 3, or 4. Using
277 longer offsets slows down the operation of PCRE because it has to load
278 additional bytes when handling them.
279
280 If you build PCRE with an increased link size, test 2 (and test 5 if
281 you are using UTF-8) will fail. Part of the output of these tests is a
282 representation of the compiled pattern, and this changes with the link
283 size.
284
285
286 AVOIDING EXCESSIVE STACK USAGE
287
288 PCRE implements backtracking while matching by making recursive calls
289 to an internal function called match(). In environments where the size
290 of the stack is limited, this can severely limit PCRE's operation. (The
291 Unix environment does not usually suffer from this problem.) An alter-
292 native approach that uses memory from the heap to remember data,
293 instead of using recursive function calls, has been implemented to work
294 round this problem. If you want to build a version of PCRE that works
295 this way, add
296
297 --disable-stack-for-recursion
298
299 to the configure command. With this configuration, PCRE will use the
300 pcre_stack_malloc and pcre_stack_free variables to call memory
301 management functions. Separate functions are provided because the usage
302 is very predictable: the block sizes requested are always the same, and
303 the blocks are always freed in reverse order. A calling program might
304 be able to implement optimized functions that perform better than the
305 standard malloc() and free() functions. PCRE runs noticeably more
306 slowly when built in this way.
307
308
309 USING EBCDIC CODE
310
311 PCRE assumes by default that it will run in an environment where the
312 character code is ASCII (or UTF-8, which is a superset of ASCII). PCRE
313 can, however, be compiled to run in an EBCDIC environment by adding
314
315 --enable-ebcdic
316
317 to the configure command.
318
319 Last updated: 09 December 2003
320 Copyright (c) 1997-2003 University of Cambridge.
321 -----------------------------------------------------------------------------
322
323 PCRE(3) PCRE(3)
324
325
326
327 NAME
328 PCRE - Perl-compatible regular expressions
329
330 SYNOPSIS OF PCRE API
331
332 #include <pcre.h>
333
334 pcre *pcre_compile(const char *pattern, int options,
335 const char **errptr, int *erroffset,
336 const unsigned char *tableptr);
337
338 pcre_extra *pcre_study(const pcre *code, int options,
339 const char **errptr);
340
341 int pcre_exec(const pcre *code, const pcre_extra *extra,
342 const char *subject, int length, int startoffset,
343 int options, int *ovector, int ovecsize);
344
345 int pcre_copy_named_substring(const pcre *code,
346 const char *subject, int *ovector,
347 int stringcount, const char *stringname,
348 char *buffer, int buffersize);
349
350 int pcre_copy_substring(const char *subject, int *ovector,
351 int stringcount, int stringnumber, char *buffer,
352 int buffersize);
353
354 int pcre_get_named_substring(const pcre *code,
355 const char *subject, int *ovector,
356 int stringcount, const char *stringname,
357 const char **stringptr);
358
359 int pcre_get_stringnumber(const pcre *code,
360 const char *name);
361
362 int pcre_get_substring(const char *subject, int *ovector,
363 int stringcount, int stringnumber,
364 const char **stringptr);
365
366 int pcre_get_substring_list(const char *subject,
367 int *ovector, int stringcount, const char ***listptr);
368
369 void pcre_free_substring(const char *stringptr);
370
371 void pcre_free_substring_list(const char **stringptr);
372
373 const unsigned char *pcre_maketables(void);
374
375 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
376 int what, void *where);
377
378 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
379
380 int pcre_config(int what, void *where);
381
382 char *pcre_version(void);
383
384 void *(*pcre_malloc)(size_t);
385
386 void (*pcre_free)(void *);
387
388 void *(*pcre_stack_malloc)(size_t);
389
390 void (*pcre_stack_free)(void *);
391
392 int (*pcre_callout)(pcre_callout_block *);
393
394
395 PCRE API
396
397 PCRE has its own native API, which is described in this document. There
398 is also a set of wrapper functions that correspond to the POSIX regular
399 expression API. These are described in the pcreposix documentation.
400
401 The native API function prototypes are defined in the header file
402 pcre.h, and on Unix systems the library itself is called libpcre.a, so
403 can be accessed by adding -lpcre to the command for linking an applica-
404 tion which calls it. The header file defines the macros PCRE_MAJOR and
405 PCRE_MINOR to contain the major and minor release numbers for the
406 library. Applications can use these to include support for different
407 releases.
408
409 The functions pcre_compile(), pcre_study(), and pcre_exec() are used
410 for compiling and matching regular expressions. A sample program that
411 demonstrates the simplest way of using them is given in the file pcre-
412 demo.c. The pcresample documentation describes how to run it.
413
414 There are convenience functions for extracting captured substrings from
415 a matched subject string. They are:
416
417 pcre_copy_substring()
418 pcre_copy_named_substring()
419 pcre_get_substring()
420 pcre_get_named_substring()
421 pcre_get_substring_list()
422
423 pcre_free_substring() and pcre_free_substring_list() are also provided,
424 to free the memory used for extracted strings.
425
426 The function pcre_maketables() is used (optionally) to build a set of
427 character tables in the current locale for passing to pcre_compile().
428
429 The function pcre_fullinfo() is used to find out information about a
430 compiled pattern; pcre_info() is an obsolete version which returns only
431 some of the available information, but is retained for backwards com-
432 patibility. The function pcre_version() returns a pointer to a string
433 containing the version of PCRE and its date of release.
434
435 The global variables pcre_malloc and pcre_free initially contain the
436 entry points of the standard malloc() and free() functions respec-
437 tively. PCRE calls the memory management functions via these variables,
438 so a calling program can replace them if it wishes to intercept the
439 calls. This should be done before calling any PCRE functions.
440
441 The global variables pcre_stack_malloc and pcre_stack_free are also
442 indirections to memory management functions. These special functions
443 are used only when PCRE is compiled to use the heap for remembering
444 data, instead of recursive function calls. This is a non-standard way
445 of building PCRE, for use in environments that have limited stacks.
446 Because of the greater use of memory management, it runs more slowly.
447 Separate functions are provided so that special-purpose external code
448 can be used for this case. When used, these functions are always called
449 in a stack-like manner (last obtained, first freed), and always for
450 memory blocks of the same size.
451
452 The global variable pcre_callout initially contains NULL. It can be set
453 by the caller to a "callout" function, which PCRE will then call at
454 specified points during a matching operation. Details are given in the
455 pcrecallout documentation.
456
457
458 MULTITHREADING
459
460 The PCRE functions can be used in multi-threading applications, with
461 the proviso that the memory management functions pointed to by
462 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
463 callout function pointed to by pcre_callout, are shared by all threads.
464
465 The compiled form of a regular expression is not altered during match-
466 ing, so the same compiled pattern can safely be used by several threads
467 at once.
468
469
470 CHECKING BUILD-TIME OPTIONS
471
472 int pcre_config(int what, void *where);
473
474 The function pcre_config() makes it possible for a PCRE client to dis-
475 cover which optional features have been compiled into the PCRE library.
476 The pcrebuild documentation has more details about these optional fea-
477 tures.
478
479 The first argument for pcre_config() is an integer, specifying which
480 information is required; the second argument is a pointer to a variable
481 into which the information is placed. The following information is
482 available:
483
484 PCRE_CONFIG_UTF8
485
486 The output is an integer that is set to one if UTF-8 support is avail-
487 able; otherwise it is set to zero.
488
489 PCRE_CONFIG_NEWLINE
490
491 The output is an integer that is set to the value of the code that is
492 used for the newline character. It is either linefeed (10) or carriage
493 return (13), and should normally be the standard character for your
494 operating system.
495
496 PCRE_CONFIG_LINK_SIZE
497
498 The output is an integer that contains the number of bytes used for
499 internal linkage in compiled regular expressions. The value is 2, 3, or
500 4. Larger values allow larger regular expressions to be compiled, at
501 the expense of slower matching. The default value of 2 is sufficient
502 for all but the most massive patterns, since it allows the compiled
503 pattern to be up to 64K in size.
504
505 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
506
507 The output is an integer that contains the threshold above which the
508 POSIX interface uses malloc() for output vectors. Further details are
509 given in the pcreposix documentation.
510
511 PCRE_CONFIG_MATCH_LIMIT
512
513 The output is an integer that gives the default limit for the number of
514 internal matching function calls in a pcre_exec() execution. Further
515 details are given with pcre_exec() below.
516
517 PCRE_CONFIG_STACKRECURSE
518
519 The output is an integer that is set to one if internal recursion is
520 implemented by recursive function calls that use the stack to remember
521 their state. This is the usual way that PCRE is compiled. The output is
522 zero if PCRE was compiled to use blocks of data on the heap instead of
523 recursive function calls. In this case, pcre_stack_malloc and
524 pcre_stack_free are called to manage memory blocks on the heap, thus
525 avoiding the use of the stack.
526
527
528 COMPILING A PATTERN
529
530 pcre *pcre_compile(const char *pattern, int options,
531 const char **errptr, int *erroffset,
532 const unsigned char *tableptr);
533
534
535 The function pcre_compile() is called to compile a pattern into an
536 internal form. The pattern is a C string terminated by a binary zero,
537 and is passed in the argument pattern. A pointer to a single block of
538 memory that is obtained via pcre_malloc is returned. This contains the
539 compiled code and related data. The pcre type is defined for the
540 returned block; this is a typedef for a structure whose contents are
541 not externally defined. It is up to the caller to free the memory when
542 it is no longer required.
543
544 Although the compiled code of a PCRE regex is relocatable, that is, it
545 does not depend on memory location, the complete pcre data block is not
546 fully relocatable, because it contains a copy of the tableptr argument,
547 which is an address (see below).
548
549 The options argument contains independent bits that affect the compila-
550 tion. It should be zero if no options are required. Some of the
551 options, in particular, those that are compatible with Perl, can also
552 be set and unset from within the pattern (see the detailed description
553 of regular expressions in the pcrepattern documentation). For these
554 options, the contents of the options argument specifies their initial
555 settings at the start of compilation and execution. The PCRE_ANCHORED
556 option can be set at the time of matching as well as at compile time.
557
558 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
559 if compilation of a pattern fails, pcre_compile() returns NULL, and
560 sets the variable pointed to by errptr to point to a textual error mes-
561 sage. The offset from the start of the pattern to the character where
562 the error was discovered is placed in the variable pointed to by
563 erroffset, which must not be NULL. If it is, an immediate error is
564 given.
565
566 If the final argument, tableptr, is NULL, PCRE uses a default set of
567 character tables which are built when it is compiled, using the default
568 C locale. Otherwise, tableptr must be the result of a call to
569 pcre_maketables(). See the section on locale support below.
570
571 This code fragment shows a typical straightforward call to pcre_com-
572 pile():
573
574 pcre *re;
575 const char *error;
576 int erroffset;
577 re = pcre_compile(
578 "^A.*Z", /* the pattern */
579 0, /* default options */
580 &error, /* for error message */
581 &erroffset, /* for error offset */
582 NULL); /* use default character tables */
583
584 The following option bits are defined:
585
586 PCRE_ANCHORED
587
588 If this bit is set, the pattern is forced to be "anchored", that is, it
589 is constrained to match only at the first matching point in the string
590 which is being searched (the "subject string"). This effect can also be
591 achieved by appropriate constructs in the pattern itself, which is the
592 only way to do it in Perl.
593
594 PCRE_CASELESS
595
596 If this bit is set, letters in the pattern match both upper and lower
597 case letters. It is equivalent to Perl's /i option, and it can be
598 changed within a pattern by a (?i) option setting.
599
600 PCRE_DOLLAR_ENDONLY
601
602 If this bit is set, a dollar metacharacter in the pattern matches only
603 at the end of the subject string. Without this option, a dollar also
604 matches immediately before the final character if it is a newline (but
605 not before any other newlines). The PCRE_DOLLAR_ENDONLY option is
606 ignored if PCRE_MULTILINE is set. There is no equivalent to this option
607 in Perl, and no way to set it within a pattern.
608
609 PCRE_DOTALL
610
611 If this bit is set, a dot metacharater in the pattern matches all char-
612 acters, including newlines. Without it, newlines are excluded. This
613 option is equivalent to Perl's /s option, and it can be changed within
614 a pattern by a (?s) option setting. A negative class such as [^a]
615 always matches a newline character, independent of the setting of this
616 option.
617
618 PCRE_EXTENDED
619
620 If this bit is set, whitespace data characters in the pattern are
621 totally ignored except when escaped or inside a character class.
622 Whitespace does not include the VT character (code 11). In addition,
623 characters between an unescaped # outside a character class and the
624 next newline character, inclusive, are also ignored. This is equivalent
625 to Perl's /x option, and it can be changed within a pattern by a (?x)
626 option setting.
627
628 This option makes it possible to include comments inside complicated
629 patterns. Note, however, that this applies only to data characters.
630 Whitespace characters may never appear within special character
631 sequences in a pattern, for example within the sequence (?( which
632 introduces a conditional subpattern.
633
634 PCRE_EXTRA
635
636 This option was invented in order to turn on additional functionality
637 of PCRE that is incompatible with Perl, but it is currently of very
638 little use. When set, any backslash in a pattern that is followed by a
639 letter that has no special meaning causes an error, thus reserving
640 these combinations for future expansion. By default, as in Perl, a
641 backslash followed by a letter with no special meaning is treated as a
642 literal. There are at present no other features controlled by this
643 option. It can also be set by a (?X) option setting within a pattern.
644
645 PCRE_MULTILINE
646
647 By default, PCRE treats the subject string as consisting of a single
648 "line" of characters (even if it actually contains several newlines).
649 The "start of line" metacharacter (^) matches only at the start of the
650 string, while the "end of line" metacharacter ($) matches only at the
651 end of the string, or before a terminating newline (unless PCRE_DOL-
652 LAR_ENDONLY is set). This is the same as Perl.
653
654 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
655 constructs match immediately following or immediately before any new-
656 line in the subject string, respectively, as well as at the very start
657 and end. This is equivalent to Perl's /m option, and it can be changed
658 within a pattern by a (?m) option setting. If there are no "\n" charac-
659 ters in a subject string, or no occurrences of ^ or $ in a pattern,
660 setting PCRE_MULTILINE has no effect.
661
662 PCRE_NO_AUTO_CAPTURE
663
664 If this option is set, it disables the use of numbered capturing paren-
665 theses in the pattern. Any opening parenthesis that is not followed by
666 ? behaves as if it were followed by ?: but named parentheses can still
667 be used for capturing (and they acquire numbers in the usual way).
668 There is no equivalent of this option in Perl.
669
670 PCRE_UNGREEDY
671
672 This option inverts the "greediness" of the quantifiers so that they
673 are not greedy by default, but become greedy if followed by "?". It is
674 not compatible with Perl. It can also be set by a (?U) option setting
675 within the pattern.
676
677 PCRE_UTF8
678
679 This option causes PCRE to regard both the pattern and the subject as
680 strings of UTF-8 characters instead of single-byte character strings.
681 However, it is available only if PCRE has been built to include UTF-8
682 support. If not, the use of this option provokes an error. Details of
683 how this option changes the behaviour of PCRE are given in the section
684 on UTF-8 support in the main pcre page.
685
686 PCRE_NO_UTF8_CHECK
687
688 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
689 automatically checked. If an invalid UTF-8 sequence of bytes is found,
690 pcre_compile() returns an error. If you already know that your pattern
691 is valid, and you want to skip this check for performance reasons, you
692 can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
693 passing an invalid UTF-8 string as a pattern is undefined. It may cause
694 your program to crash. Note that there is a similar option for sup-
695 pressing the checking of subject strings passed to pcre_exec().
696
697
698
699 STUDYING A PATTERN
700
701 pcre_extra *pcre_study(const pcre *code, int options,
702 const char **errptr);
703
704 When a pattern is going to be used several times, it is worth spending
705 more time analyzing it in order to speed up the time taken for match-
706 ing. The function pcre_study() takes a pointer to a compiled pattern as
707 its first argument. If studing the pattern produces additional informa-
708 tion that will help speed up matching, pcre_study() returns a pointer
709 to a pcre_extra block, in which the study_data field points to the
710 results of the study.
711
712 The returned value from a pcre_study() can be passed directly to
713 pcre_exec(). However, the pcre_extra block also contains other fields
714 that can be set by the caller before the block is passed; these are
715 described below. If studying the pattern does not produce any addi-
716 tional information, pcre_study() returns NULL. In that circumstance, if
717 the calling program wants to pass some of the other fields to
718 pcre_exec(), it must set up its own pcre_extra block.
719
720 The second argument contains option bits. At present, no options are
721 defined for pcre_study(), and this argument should always be zero.
722
723 The third argument for pcre_study() is a pointer for an error message.
724 If studying succeeds (even if no data is returned), the variable it
725 points to is set to NULL. Otherwise it points to a textual error mes-
726 sage. You should therefore test the error pointer for NULL after call-
727 ing pcre_study(), to be sure that it has run successfully.
728
729 This is a typical call to pcre_study():
730
731 pcre_extra *pe;
732 pe = pcre_study(
733 re, /* result of pcre_compile() */
734 0, /* no options exist */
735 &error); /* set to NULL or points to a message */
736
737 At present, studying a pattern is useful only for non-anchored patterns
738 that do not have a single fixed starting character. A bitmap of possi-
739 ble starting characters is created.
740
741
742 LOCALE SUPPORT
743
744 PCRE handles caseless matching, and determines whether characters are
745 letters, digits, or whatever, by reference to a set of tables. When
746 running in UTF-8 mode, this applies only to characters with codes less
747 than 256. The library contains a default set of tables that is created
748 in the default C locale when PCRE is compiled. This is used when the
749 final argument of pcre_compile() is NULL, and is sufficient for many
750 applications.
751
752 An alternative set of tables can, however, be supplied. Such tables are
753 built by calling the pcre_maketables() function, which has no argu-
754 ments, in the relevant locale. The result can then be passed to
755 pcre_compile() as often as necessary. For example, to build and use
756 tables that are appropriate for the French locale (where accented char-
757 acters with codes greater than 128 are treated as letters), the follow-
758 ing code could be used:
759
760 setlocale(LC_CTYPE, "fr");
761 tables = pcre_maketables();
762 re = pcre_compile(..., tables);
763
764 The tables are built in memory that is obtained via pcre_malloc. The
765 pointer that is passed to pcre_compile is saved with the compiled pat-
766 tern, and the same tables are used via this pointer by pcre_study() and
767 pcre_exec(). Thus, for any single pattern, compilation, studying and
768 matching all happen in the same locale, but different patterns can be
769 compiled in different locales. It is the caller's responsibility to
770 ensure that the memory containing the tables remains available for as
771 long as it is needed.
772
773
774 INFORMATION ABOUT A PATTERN
775
776 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
777 int what, void *where);
778
779 The pcre_fullinfo() function returns information about a compiled pat-
780 tern. It replaces the obsolete pcre_info() function, which is neverthe-
781 less retained for backwards compability (and is documented below).
782
783 The first argument for pcre_fullinfo() is a pointer to the compiled
784 pattern. The second argument is the result of pcre_study(), or NULL if
785 the pattern was not studied. The third argument specifies which piece
786 of information is required, and the fourth argument is a pointer to a
787 variable to receive the data. The yield of the function is zero for
788 success, or one of the following negative numbers:
789
790 PCRE_ERROR_NULL the argument code was NULL
791 the argument where was NULL
792 PCRE_ERROR_BADMAGIC the "magic number" was not found
793 PCRE_ERROR_BADOPTION the value of what was invalid
794
795 Here is a typical call of pcre_fullinfo(), to obtain the length of the
796 compiled pattern:
797
798 int rc;
799 unsigned long int length;
800 rc = pcre_fullinfo(
801 re, /* result of pcre_compile() */
802 pe, /* result of pcre_study(), or NULL */
803 PCRE_INFO_SIZE, /* what is required */
804 &length); /* where to put the data */
805
806 The possible values for the third argument are defined in pcre.h, and
807 are as follows:
808
809 PCRE_INFO_BACKREFMAX
810
811 Return the number of the highest back reference in the pattern. The
812 fourth argument should point to an int variable. Zero is returned if
813 there are no back references.
814
815 PCRE_INFO_CAPTURECOUNT
816
817 Return the number of capturing subpatterns in the pattern. The fourth
818 argument should point to an int variable.
819
820 PCRE_INFO_FIRSTBYTE
821
822 Return information about the first byte of any matched string, for a
823 non-anchored pattern. (This option used to be called
824 PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards
825 compatibility.)
826
827 If there is a fixed first byte, e.g. from a pattern such as
828 (cat|cow|coyote), it is returned in the integer pointed to by where.
829 Otherwise, if either
830
831 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
832 branch starts with "^", or
833
834 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
835 set (if it were set, the pattern would be anchored),
836
837 -1 is returned, indicating that the pattern matches only at the start
838 of a subject string or after any newline within the string. Otherwise
839 -2 is returned. For anchored patterns, -2 is returned.
840
841 PCRE_INFO_FIRSTTABLE
842
843 If the pattern was studied, and this resulted in the construction of a
844 256-bit table indicating a fixed set of bytes for the first byte in any
845 matching string, a pointer to the table is returned. Otherwise NULL is
846 returned. The fourth argument should point to an unsigned char * vari-
847 able.
848
849 PCRE_INFO_LASTLITERAL
850
851 Return the value of the rightmost literal byte that must exist in any
852 matched string, other than at its start, if such a byte has been
853 recorded. The fourth argument should point to an int variable. If there
854 is no such byte, -1 is returned. For anchored patterns, a last literal
855 byte is recorded only if it follows something of variable length. For
856 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
857 /^a\dz\d/ the returned value is -1.
858
859 PCRE_INFO_NAMECOUNT
860 PCRE_INFO_NAMEENTRYSIZE
861 PCRE_INFO_NAMETABLE
862
863 PCRE supports the use of named as well as numbered capturing parenthe-
864 ses. The names are just an additional way of identifying the parenthe-
865 ses, which still acquire a number. A caller that wants to extract data
866 from a named subpattern must convert the name to a number in order to
867 access the correct pointers in the output vector (described with
868 pcre_exec() below). In order to do this, it must first use these three
869 values to obtain the name-to-number mapping table for the pattern.
870
871 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
872 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
873 of each entry; both of these return an int value. The entry size
874 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
875 a pointer to the first entry of the table (a pointer to char). The
876 first two bytes of each entry are the number of the capturing parenthe-
877 sis, most significant byte first. The rest of the entry is the corre-
878 sponding name, zero terminated. The names are in alphabetical order.
879 For example, consider the following pattern (assume PCRE_EXTENDED is
880 set, so white space - including newlines - is ignored):
881
882 (?P<date> (?P<year>(\d\d)?\d\d) -
883 (?P<month>\d\d) - (?P<day>\d\d) )
884
885 There are four named subpatterns, so the table has four entries, and
886 each entry in the table is eight bytes long. The table is as follows,
887 with non-printing bytes shows in hex, and undefined bytes shown as ??:
888
889 00 01 d a t e 00 ??
890 00 05 d a y 00 ?? ??
891 00 04 m o n t h 00
892 00 02 y e a r 00 ??
893
894 When writing code to extract data from named subpatterns, remember that
895 the length of each entry may be different for each compiled pattern.
896
897 PCRE_INFO_OPTIONS
898
899 Return a copy of the options with which the pattern was compiled. The
900 fourth argument should point to an unsigned long int variable. These
901 option bits are those specified in the call to pcre_compile(), modified
902 by any top-level option settings within the pattern itself.
903
904 A pattern is automatically anchored by PCRE if all of its top-level
905 alternatives begin with one of the following:
906
907 ^ unless PCRE_MULTILINE is set
908 \A always
909 \G always
910 .* if PCRE_DOTALL is set and there are no back
911 references to the subpattern in which .* appears
912
913 For such patterns, the PCRE_ANCHORED bit is set in the options returned
914 by pcre_fullinfo().
915
916 PCRE_INFO_SIZE
917
918 Return the size of the compiled pattern, that is, the value that was
919 passed as the argument to pcre_malloc() when PCRE was getting memory in
920 which to place the compiled data. The fourth argument should point to a
921 size_t variable.
922
923 PCRE_INFO_STUDYSIZE
924
925 Returns the size of the data block pointed to by the study_data field
926 in a pcre_extra block. That is, it is the value that was passed to
927 pcre_malloc() when PCRE was getting memory into which to place the data
928 created by pcre_study(). The fourth argument should point to a size_t
929 variable.
930
931
932 OBSOLETE INFO FUNCTION
933
934 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
935
936 The pcre_info() function is now obsolete because its interface is too
937 restrictive to return all the available data about a compiled pattern.
938 New programs should use pcre_fullinfo() instead. The yield of
939 pcre_info() is the number of capturing subpatterns, or one of the fol-
940 lowing negative numbers:
941
942 PCRE_ERROR_NULL the argument code was NULL
943 PCRE_ERROR_BADMAGIC the "magic number" was not found
944
945 If the optptr argument is not NULL, a copy of the options with which
946 the pattern was compiled is placed in the integer it points to (see
947 PCRE_INFO_OPTIONS above).
948
949 If the pattern is not anchored and the firstcharptr argument is not
950 NULL, it is used to pass back information about the first character of
951 any matched string (see PCRE_INFO_FIRSTBYTE above).
952
953
954 MATCHING A PATTERN
955
956 int pcre_exec(const pcre *code, const pcre_extra *extra,
957 const char *subject, int length, int startoffset,
958 int options, int *ovector, int ovecsize);
959
960 The function pcre_exec() is called to match a subject string against a
961 pre-compiled pattern, which is passed in the code argument. If the pat-
962 tern has been studied, the result of the study should be passed in the
963 extra argument.
964
965 Here is an example of a simple call to pcre_exec():
966
967 int rc;
968 int ovector[30];
969 rc = pcre_exec(
970 re, /* result of pcre_compile() */
971 NULL, /* we didn't study the pattern */
972 "some string", /* the subject string */
973 11, /* the length of the subject string */
974 0, /* start at offset 0 in the subject */
975 0, /* default options */
976 ovector, /* vector for substring information */
977 30); /* number of elements in the vector */
978
979 If the extra argument is not NULL, it must point to a pcre_extra data
980 block. The pcre_study() function returns such a block (when it doesn't
981 return NULL), but you can also create one for yourself, and pass addi-
982 tional information in it. The fields in the block are as follows:
983
984 unsigned long int flags;
985 void *study_data;
986 unsigned long int match_limit;
987 void *callout_data;
988
989 The flags field is a bitmap that specifies which of the other fields
990 are set. The flag bits are:
991
992 PCRE_EXTRA_STUDY_DATA
993 PCRE_EXTRA_MATCH_LIMIT
994 PCRE_EXTRA_CALLOUT_DATA
995
996 Other flag bits should be set to zero. The study_data field is set in
997 the pcre_extra block that is returned by pcre_study(), together with
998 the appropriate flag bit. You should not set this yourself, but you can
999 add to the block by setting the other fields.
1000
1001 The match_limit field provides a means of preventing PCRE from using up
1002 a vast amount of resources when running patterns that are not going to
1003 match, but which have a very large number of possibilities in their
1004 search trees. The classic example is the use of nested unlimited
1005 repeats. Internally, PCRE uses a function called match() which it calls
1006 repeatedly (sometimes recursively). The limit is imposed on the number
1007 of times this function is called during a match, which has the effect
1008 of limiting the amount of recursion and backtracking that can take
1009 place. For patterns that are not anchored, the count starts from zero
1010 for each position in the subject string.
1011
1012 The default limit for the library can be set when PCRE is built; the
1013 default default is 10 million, which handles all but the most extreme
1014 cases. You can reduce the default by suppling pcre_exec() with a
1015 pcre_extra block in which match_limit is set to a smaller value, and
1016 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1017 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1018
1019 The pcre_callout field is used in conjunction with the "callout" fea-
1020 ture, which is described in the pcrecallout documentation.
1021
1022 The PCRE_ANCHORED option can be passed in the options argument, whose
1023 unused bits must be zero. This limits pcre_exec() to matching at the
1024 first matching position. However, if a pattern was compiled with
1025 PCRE_ANCHORED, or turned out to be anchored by virtue of its contents,
1026 it cannot be made unachored at matching time.
1027
1028 When PCRE_UTF8 was set at compile time, the validity of the subject as
1029 a UTF-8 string is automatically checked, and the value of startoffset
1030 is also checked to ensure that it points to the start of a UTF-8 char-
1031 acter. If an invalid UTF-8 sequence of bytes is found, pcre_exec()
1032 returns the error PCRE_ERROR_BADUTF8. If startoffset contains an
1033 invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1034
1035 If you already know that your subject is valid, and you want to skip
1036 these checks for performance reasons, you can set the
1037 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1038 do this for the second and subsequent calls to pcre_exec() if you are
1039 making repeated calls to find all the matches in a single subject
1040 string. However, you should be sure that the value of startoffset
1041 points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1042 set, the effect of passing an invalid UTF-8 string as a subject, or a
1043 value of startoffset that does not point to the start of a UTF-8 char-
1044 acter, is undefined. Your program may crash.
1045
1046 There are also three further options that can be set only at matching
1047 time:
1048
1049 PCRE_NOTBOL
1050
1051 The first character of the string is not the beginning of a line, so
1052 the circumflex metacharacter should not match before it. Setting this
1053 without PCRE_MULTILINE (at compile time) causes circumflex never to
1054 match.
1055
1056 PCRE_NOTEOL
1057
1058 The end of the string is not the end of a line, so the dollar metachar-
1059 acter should not match it nor (except in multiline mode) a newline
1060 immediately before it. Setting this without PCRE_MULTILINE (at compile
1061 time) causes dollar never to match.
1062
1063 PCRE_NOTEMPTY
1064
1065 An empty string is not considered to be a valid match if this option is
1066 set. If there are alternatives in the pattern, they are tried. If all
1067 the alternatives match the empty string, the entire match fails. For
1068 example, if the pattern
1069
1070 a?b?
1071
1072 is applied to a string not beginning with "a" or "b", it matches the
1073 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1074 match is not valid, so PCRE searches further into the string for occur-
1075 rences of "a" or "b".
1076
1077 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1078 cial case of a pattern match of the empty string within its split()
1079 function, and when using the /g modifier. It is possible to emulate
1080 Perl's behaviour after matching a null string by first trying the match
1081 again at the same offset with PCRE_NOTEMPTY set, and then if that fails
1082 by advancing the starting offset (see below) and trying an ordinary
1083 match again.
1084
1085 The subject string is passed to pcre_exec() as a pointer in subject, a
1086 length in length, and a starting byte offset in startoffset. Unlike the
1087 pattern string, the subject may contain binary zero bytes. When the
1088 starting offset is zero, the search for a match starts at the beginning
1089 of the subject, and this is by far the most common case.
1090
1091 If the pattern was compiled with the PCRE_UTF8 option, the subject must
1092 be a sequence of bytes that is a valid UTF-8 string, and the starting
1093 offset must point to the beginning of a UTF-8 character. If an invalid
1094 UTF-8 string or offset is passed, an error (either PCRE_ERROR_BADUTF8
1095 or PCRE_ERROR_BADUTF8_OFFSET) is returned, unless the option
1096 PCRE_NO_UTF8_CHECK is set, in which case PCRE's behaviour is not
1097 defined.
1098
1099 A non-zero starting offset is useful when searching for another match
1100 in the same subject by calling pcre_exec() again after a previous suc-
1101 cess. Setting startoffset differs from just passing over a shortened
1102 string and setting PCRE_NOTBOL in the case of a pattern that begins
1103 with any kind of lookbehind. For example, consider the pattern
1104
1105 \Biss\B
1106
1107 which finds occurrences of "iss" in the middle of words. (\B matches
1108 only if the current position in the subject is not a word boundary.)
1109 When applied to the string "Mississipi" the first call to pcre_exec()
1110 finds the first occurrence. If pcre_exec() is called again with just
1111 the remainder of the subject, namely "issipi", it does not match,
1112 because \B is always false at the start of the subject, which is deemed
1113 to be a word boundary. However, if pcre_exec() is passed the entire
1114 string again, but with startoffset set to 4, it finds the second
1115 occurrence of "iss" because it is able to look behind the starting
1116 point to discover that it is preceded by a letter.
1117
1118 If a non-zero starting offset is passed when the pattern is anchored,
1119 one attempt to match at the given offset is tried. This can only suc-
1120 ceed if the pattern does not require the match to be at the start of
1121 the subject.
1122
1123 In general, a pattern matches a certain portion of the subject, and in
1124 addition, further substrings from the subject may be picked out by
1125 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1126 this is called "capturing" in what follows, and the phrase "capturing
1127 subpattern" is used for a fragment of a pattern that picks out a sub-
1128 string. PCRE supports several other kinds of parenthesized subpattern
1129 that do not cause substrings to be captured.
1130
1131 Captured substrings are returned to the caller via a vector of integer
1132 offsets whose address is passed in ovector. The number of elements in
1133 the vector is passed in ovecsize. The first two-thirds of the vector is
1134 used to pass back captured substrings, each substring using a pair of
1135 integers. The remaining third of the vector is used as workspace by
1136 pcre_exec() while matching capturing subpatterns, and is not available
1137 for passing back information. The length passed in ovecsize should
1138 always be a multiple of three. If it is not, it is rounded down.
1139
1140 When a match has been successful, information about captured substrings
1141 is returned in pairs of integers, starting at the beginning of ovector,
1142 and continuing up to two-thirds of its length at the most. The first
1143 element of a pair is set to the offset of the first character in a sub-
1144 string, and the second is set to the offset of the first character
1145 after the end of a substring. The first pair, ovector[0] and ovec-
1146 tor[1], identify the portion of the subject string matched by the
1147 entire pattern. The next pair is used for the first capturing subpat-
1148 tern, and so on. The value returned by pcre_exec() is the number of
1149 pairs that have been set. If there are no capturing subpatterns, the
1150 return value from a successful match is 1, indicating that just the
1151 first pair of offsets has been set.
1152
1153 Some convenience functions are provided for extracting the captured
1154 substrings as separate strings. These are described in the following
1155 section.
1156
1157 It is possible for an capturing subpattern number n+1 to match some
1158 part of the subject when subpattern n has not been used at all. For
1159 example, if the string "abc" is matched against the pattern (a|(z))(bc)
1160 subpatterns 1 and 3 are matched, but 2 is not. When this happens, both
1161 offset values corresponding to the unused subpattern are set to -1.
1162
1163 If a capturing subpattern is matched repeatedly, it is the last portion
1164 of the string that it matched that gets returned.
1165
1166 If the vector is too small to hold all the captured substrings, it is
1167 used as far as possible (up to two-thirds of its length), and the func-
1168 tion returns a value of zero. In particular, if the substring offsets
1169 are not of interest, pcre_exec() may be called with ovector passed as
1170 NULL and ovecsize as zero. However, if the pattern contains back refer-
1171 ences and the ovector isn't big enough to remember the related sub-
1172 strings, PCRE has to get additional memory for use during matching.
1173 Thus it is usually advisable to supply an ovector.
1174
1175 Note that pcre_info() can be used to find out how many capturing sub-
1176 patterns there are in a compiled pattern. The smallest size for ovector
1177 that will allow for n captured substrings, in addition to the offsets
1178 of the substring matched by the whole pattern, is (n+1)*3.
1179
1180 If pcre_exec() fails, it returns a negative number. The following are
1181 defined in the header file:
1182
1183 PCRE_ERROR_NOMATCH (-1)
1184
1185 The subject string did not match the pattern.
1186
1187 PCRE_ERROR_NULL (-2)
1188
1189 Either code or subject was passed as NULL, or ovector was NULL and
1190 ovecsize was not zero.
1191
1192 PCRE_ERROR_BADOPTION (-3)
1193
1194 An unrecognized bit was set in the options argument.
1195
1196 PCRE_ERROR_BADMAGIC (-4)
1197
1198 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1199 to catch the case when it is passed a junk pointer. This is the error
1200 it gives when the magic number isn't present.
1201
1202 PCRE_ERROR_UNKNOWN_NODE (-5)
1203
1204 While running the pattern match, an unknown item was encountered in the
1205 compiled pattern. This error could be caused by a bug in PCRE or by
1206 overwriting of the compiled pattern.
1207
1208 PCRE_ERROR_NOMEMORY (-6)
1209
1210 If a pattern contains back references, but the ovector that is passed
1211 to pcre_exec() is not big enough to remember the referenced substrings,
1212 PCRE gets a block of memory at the start of matching to use for this
1213 purpose. If the call via pcre_malloc() fails, this error is given. The
1214 memory is freed at the end of matching.
1215
1216 PCRE_ERROR_NOSUBSTRING (-7)
1217
1218 This error is used by the pcre_copy_substring(), pcre_get_substring(),
1219 and pcre_get_substring_list() functions (see below). It is never
1220 returned by pcre_exec().
1221
1222 PCRE_ERROR_MATCHLIMIT (-8)
1223
1224 The recursion and backtracking limit, as specified by the match_limit
1225 field in a pcre_extra structure (or defaulted) was reached. See the
1226 description above.
1227
1228 PCRE_ERROR_CALLOUT (-9)
1229
1230 This error is never generated by pcre_exec() itself. It is provided for
1231 use by callout functions that want to yield a distinctive error code.
1232 See the pcrecallout documentation for details.
1233
1234 PCRE_ERROR_BADUTF8 (-10)
1235
1236 A string that contains an invalid UTF-8 byte sequence was passed as a
1237 subject.
1238
1239 PCRE_ERROR_BADUTF8_OFFSET (-11)
1240
1241 The UTF-8 byte sequence that was passed as a subject was valid, but the
1242 value of startoffset did not point to the beginning of a UTF-8 charac-
1243 ter.
1244
1245
1246 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1247
1248 int pcre_copy_substring(const char *subject, int *ovector,
1249 int stringcount, int stringnumber, char *buffer,
1250 int buffersize);
1251
1252 int pcre_get_substring(const char *subject, int *ovector,
1253 int stringcount, int stringnumber,
1254 const char **stringptr);
1255
1256 int pcre_get_substring_list(const char *subject,
1257 int *ovector, int stringcount, const char ***listptr);
1258
1259 Captured substrings can be accessed directly by using the offsets
1260 returned by pcre_exec() in ovector. For convenience, the functions
1261 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
1262 string_list() are provided for extracting captured substrings as new,
1263 separate, zero-terminated strings. These functions identify substrings
1264 by number. The next section describes functions for extracting named
1265 substrings. A substring that contains a binary zero is correctly
1266 extracted and has a further zero added on the end, but the result is
1267 not, of course, a C string.
1268
1269 The first three arguments are the same for all three of these func-
1270 tions: subject is the subject string which has just been successfully
1271 matched, ovector is a pointer to the vector of integer offsets that was
1272 passed to pcre_exec(), and stringcount is the number of substrings that
1273 were captured by the match, including the substring that matched the
1274 entire regular expression. This is the value returned by pcre_exec if
1275 it is greater than zero. If pcre_exec() returned zero, indicating that
1276 it ran out of space in ovector, the value passed as stringcount should
1277 be the size of the vector divided by three.
1278
1279 The functions pcre_copy_substring() and pcre_get_substring() extract a
1280 single substring, whose number is given as stringnumber. A value of
1281 zero extracts the substring that matched the entire pattern, while
1282 higher values extract the captured substrings. For pcre_copy_sub-
1283 string(), the string is placed in buffer, whose length is given by
1284 buffersize, while for pcre_get_substring() a new block of memory is
1285 obtained via pcre_malloc, and its address is returned via stringptr.
1286 The yield of the function is the length of the string, not including
1287 the terminating zero, or one of
1288
1289 PCRE_ERROR_NOMEMORY (-6)
1290
1291 The buffer was too small for pcre_copy_substring(), or the attempt to
1292 get memory failed for pcre_get_substring().
1293
1294 PCRE_ERROR_NOSUBSTRING (-7)
1295
1296 There is no substring whose number is stringnumber.
1297
1298 The pcre_get_substring_list() function extracts all available sub-
1299 strings and builds a list of pointers to them. All this is done in a
1300 single block of memory which is obtained via pcre_malloc. The address
1301 of the memory block is returned via listptr, which is also the start of
1302 the list of string pointers. The end of the list is marked by a NULL
1303 pointer. The yield of the function is zero if all went well, or
1304
1305 PCRE_ERROR_NOMEMORY (-6)
1306
1307 if the attempt to get the memory block failed.
1308
1309 When any of these functions encounter a substring that is unset, which
1310 can happen when capturing subpattern number n+1 matches some part of
1311 the subject, but subpattern n has not been used at all, they return an
1312 empty string. This can be distinguished from a genuine zero-length sub-
1313 string by inspecting the appropriate offset in ovector, which is nega-
1314 tive for unset substrings.
1315
1316 The two convenience functions pcre_free_substring() and
1317 pcre_free_substring_list() can be used to free the memory returned by a
1318 previous call of pcre_get_substring() or pcre_get_substring_list(),
1319 respectively. They do nothing more than call the function pointed to by
1320 pcre_free, which of course could be called directly from a C program.
1321 However, PCRE is used in some situations where it is linked via a spe-
1322 cial interface to another programming language which cannot use
1323 pcre_free directly; it is for these cases that the functions are pro-
1324 vided.
1325
1326
1327 EXTRACTING CAPTURED SUBSTRINGS BY NAME
1328
1329 int pcre_copy_named_substring(const pcre *code,
1330 const char *subject, int *ovector,
1331 int stringcount, const char *stringname,
1332 char *buffer, int buffersize);
1333
1334 int pcre_get_stringnumber(const pcre *code,
1335 const char *name);
1336
1337 int pcre_get_named_substring(const pcre *code,
1338 const char *subject, int *ovector,
1339 int stringcount, const char *stringname,
1340 const char **stringptr);
1341
1342 To extract a substring by name, you first have to find associated num-
1343 ber. This can be done by calling pcre_get_stringnumber(). The first
1344 argument is the compiled pattern, and the second is the name. For exam-
1345 ple, for this pattern
1346
1347 ab(?<xxx>\d+)...
1348
1349 the number of the subpattern called "xxx" is 1. Given the number, you
1350 can then extract the substring directly, or use one of the functions
1351 described in the previous section. For convenience, there are also two
1352 functions that do the whole job.
1353
1354 Most of the arguments of pcre_copy_named_substring() and
1355 pcre_get_named_substring() are the same as those for the functions that
1356 extract by number, and so are not re-described here. There are just two
1357 differences.
1358
1359 First, instead of a substring number, a substring name is given. Sec-
1360 ond, there is an extra argument, given at the start, which is a pointer
1361 to the compiled pattern. This is needed in order to gain access to the
1362 name-to-number translation table.
1363
1364 These functions call pcre_get_stringnumber(), and if it succeeds, they
1365 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
1366 ate.
1367
1368 Last updated: 09 December 2003
1369 Copyright (c) 1997-2003 University of Cambridge.
1370 -----------------------------------------------------------------------------
1371
1372 PCRE(3) PCRE(3)
1373
1374
1375
1376 NAME
1377 PCRE - Perl-compatible regular expressions
1378
1379 PCRE CALLOUTS
1380
1381 int (*pcre_callout)(pcre_callout_block *);
1382
1383 PCRE provides a feature called "callout", which is a means of temporar-
1384 ily passing control to the caller of PCRE in the middle of pattern
1385 matching. The caller of PCRE provides an external function by putting
1386 its entry point in the global variable pcre_callout. By default, this
1387 variable contains NULL, which disables all calling out.
1388
1389 Within a regular expression, (?C) indicates the points at which the
1390 external function is to be called. Different callout points can be
1391 identified by putting a number less than 256 after the letter C. The
1392 default value is zero. For example, this pattern has two callout
1393 points:
1394
1395 (?C1)abc(?C2)def
1396
1397 During matching, when PCRE reaches a callout point (and pcre_callout is
1398 set), the external function is called. Its only argument is a pointer
1399 to a pcre_callout block. This contains the following variables:
1400
1401 int version;
1402 int callout_number;
1403 int *offset_vector;
1404 const char *subject;
1405 int subject_length;
1406 int start_match;
1407 int current_position;
1408 int capture_top;
1409 int capture_last;
1410 void *callout_data;
1411
1412 The version field is an integer containing the version number of the
1413 block format. The current version is zero. The version number may
1414 change in future if additional fields are added, but the intention is
1415 never to remove any of the existing fields.
1416
1417 The callout_number field contains the number of the callout, as com-
1418 piled into the pattern (that is, the number after ?C).
1419
1420 The offset_vector field is a pointer to the vector of offsets that was
1421 passed by the caller to pcre_exec(). The contents can be inspected in
1422 order to extract substrings that have been matched so far, in the same
1423 way as for extracting substrings after a match has completed.
1424
1425 The subject and subject_length fields contain copies the values that
1426 were passed to pcre_exec().
1427
1428 The start_match field contains the offset within the subject at which
1429 the current match attempt started. If the pattern is not anchored, the
1430 callout function may be called several times for different starting
1431 points.
1432
1433 The current_position field contains the offset within the subject of
1434 the current match pointer.
1435
1436 The capture_top field contains one more than the number of the highest
1437 numbered captured substring so far. If no substrings have been
1438 captured, the value of capture_top is one.
1439
1440 The capture_last field contains the number of the most recently cap-
1441 tured substring.
1442
1443 The callout_data field contains a value that is passed to pcre_exec()
1444 by the caller specifically so that it can be passed back in callouts.
1445 It is passed in the pcre_callout field of the pcre_extra data struc-
1446 ture. If no such data was passed, the value of callout_data in a
1447 pcre_callout block is NULL. There is a description of the pcre_extra
1448 structure in the pcreapi documentation.
1449
1450
1451
1452 RETURN VALUES
1453
1454 The callout function returns an integer. If the value is zero, matching
1455 proceeds as normal. If the value is greater than zero, matching fails
1456 at the current point, but backtracking to test other possibilities goes
1457 ahead, just as if a lookahead assertion had failed. If the value is
1458 less than zero, the match is abandoned, and pcre_exec() returns the
1459 value.
1460
1461 Negative values should normally be chosen from the set of
1462 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
1463 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
1464 reserved for use by callout functions; it will never be used by PCRE
1465 itself.
1466
1467 Last updated: 21 January 2003
1468 Copyright (c) 1997-2003 University of Cambridge.
1469 -----------------------------------------------------------------------------
1470
1471 PCRE(3) PCRE(3)
1472
1473
1474
1475 NAME
1476 PCRE - Perl-compatible regular expressions
1477
1478 DIFFERENCES FROM PERL
1479
1480 This document describes the differences in the ways that PCRE and Perl
1481 handle regular expressions. The differences described here are with
1482 respect to Perl 5.8.
1483
1484 1. PCRE does not have full UTF-8 support. Details of what it does have
1485 are given in the section on UTF-8 support in the main pcre page.
1486
1487 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
1488 permits them, but they do not mean what you might think. For example,
1489 (?!a){3} does not assert that the next three characters are not "a". It
1490 just asserts that the next character is not "a" three times.
1491
1492 3. Capturing subpatterns that occur inside negative lookahead asser-
1493 tions are counted, but their entries in the offsets vector are never
1494 set. Perl sets its numerical variables from any such patterns that are
1495 matched before the assertion fails to match something (thereby succeed-
1496 ing), but only if the negative lookahead assertion contains just one
1497 branch.
1498
1499 4. Though binary zero characters are supported in the subject string,
1500 they are not allowed in a pattern string because it is passed as a nor-
1501 mal C string, terminated by zero. The escape sequence "\0" can be used
1502 in the pattern to represent a binary zero.
1503
1504 5. The following Perl escape sequences are not supported: \l, \u, \L,
1505 \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
1506 string-handling and are not part of its pattern matching engine. If any
1507 of these are encountered by PCRE, an error is generated.
1508
1509 6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
1510 ters in between are treated as literals. This is slightly different
1511 from Perl in that $ and @ are also handled as literals inside the
1512 quotes. In Perl, they cause variable interpolation (but of course PCRE
1513 does not have variables). Note the following examples:
1514
1515 Pattern PCRE matches Perl matches
1516
1517 \Qabc$xyz\E abc$xyz abc followed by the
1518 contents of $xyz
1519 \Qabc\$xyz\E abc\$xyz abc\$xyz
1520 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1521
1522 The \Q...\E sequence is recognized both inside and outside character
1523 classes.
1524
1525 7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
1526 constructions. However, there is some experimental support for recur-
1527 sive patterns using the non-Perl items (?R), (?number) and (?P>name).
1528 Also, the PCRE "callout" feature allows an external function to be
1529 called during pattern matching.
1530
1531 8. There are some differences that are concerned with the settings of
1532 captured strings when part of a pattern is repeated. For example,
1533 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
1534 unset, but in PCRE it is set to "b".
1535
1536 9. PCRE provides some extensions to the Perl regular expression
1537 facilities:
1538
1539 (a) Although lookbehind assertions must match fixed length strings,
1540 each alternative branch of a lookbehind assertion can match a different
1541 length of string. Perl requires them all to have the same length.
1542
1543 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
1544 meta-character matches only at the very end of the string.
1545
1546 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
1547 cial meaning is faulted.
1548
1549 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
1550 fiers is inverted, that is, by default they are not greedy, but if fol-
1551 lowed by a question mark they are.
1552
1553 (e) PCRE_ANCHORED can be used to force a pattern to be tried only at
1554 the first matching position in the subject string.
1555
1556 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
1557 TURE options for pcre_exec() have no Perl equivalents.
1558
1559 (g) The (?R), (?number), and (?P>name) constructs allows for recursive
1560 pattern matching (Perl can do this using the (?p{code}) construct,
1561 which PCRE cannot support.)
1562
1563 (h) PCRE supports named capturing substrings, using the Python syntax.
1564
1565 (i) PCRE supports the possessive quantifier "++" syntax, taken from
1566 Sun's Java package.
1567
1568 (j) The (R) condition, for testing recursion, is a PCRE extension.
1569
1570 (k) The callout facility is PCRE-specific.
1571
1572 Last updated: 09 December 2003
1573 Copyright (c) 1997-2003 University of Cambridge.
1574 -----------------------------------------------------------------------------
1575
1576 PCRE(3) PCRE(3)
1577
1578
1579
1580 NAME
1581 PCRE - Perl-compatible regular expressions
1582
1583 PCRE REGULAR EXPRESSION DETAILS
1584
1585 The syntax and semantics of the regular expressions supported by PCRE
1586 are described below. Regular expressions are also described in the Perl
1587 documentation and in a number of other books, some of which have copi-
1588 ous examples. Jeffrey Friedl's "Mastering Regular Expressions", pub-
1589 lished by O'Reilly, covers them in great detail. The description here
1590 is intended as reference documentation.
1591
1592 The basic operation of PCRE is on strings of bytes. However, there is
1593 also support for UTF-8 character strings. To use this support you must
1594 build PCRE to include UTF-8 support, and then call pcre_compile() with
1595 the PCRE_UTF8 option. How this affects the pattern matching is men-
1596 tioned in several places below. There is also a summary of UTF-8 fea-
1597 tures in the section on UTF-8 support in the main pcre page.
1598
1599 A regular expression is a pattern that is matched against a subject
1600 string from left to right. Most characters stand for themselves in a
1601 pattern, and match the corresponding characters in the subject. As a
1602 trivial example, the pattern
1603
1604 The quick brown fox
1605
1606 matches a portion of a subject string that is identical to itself. The
1607 power of regular expressions comes from the ability to include alterna-
1608 tives and repetitions in the pattern. These are encoded in the pattern
1609 by the use of meta-characters, which do not stand for themselves but
1610 instead are interpreted in some special way.
1611
1612 There are two different sets of meta-characters: those that are recog-
1613 nized anywhere in the pattern except within square brackets, and those
1614 that are recognized in square brackets. Outside square brackets, the
1615 meta-characters are as follows:
1616
1617 \ general escape character with several uses
1618 ^ assert start of string (or line, in multiline mode)
1619 $ assert end of string (or line, in multiline mode)
1620 . match any character except newline (by default)
1621 [ start character class definition
1622 | start of alternative branch
1623 ( start subpattern
1624 ) end subpattern
1625 ? extends the meaning of (
1626 also 0 or 1 quantifier
1627 also quantifier minimizer
1628 * 0 or more quantifier
1629 + 1 or more quantifier
1630 also "possessive quantifier"
1631 { start min/max quantifier
1632
1633 Part of a pattern that is in square brackets is called a "character
1634 class". In a character class the only meta-characters are:
1635
1636 \ general escape character
1637 ^ negate the class, but only if the first character
1638 - indicates character range
1639 [ POSIX character class (only if followed by POSIX
1640 syntax)
1641 ] terminates the character class
1642
1643 The following sections describe the use of each of the meta-characters.
1644
1645
1646 BACKSLASH
1647
1648 The backslash character has several uses. Firstly, if it is followed by
1649 a non-alphameric character, it takes away any special meaning that
1650 character may have. This use of backslash as an escape character
1651 applies both inside and outside character classes.
1652
1653 For example, if you want to match a * character, you write \* in the
1654 pattern. This escaping action applies whether or not the following
1655 character would otherwise be interpreted as a meta-character, so it is
1656 always safe to precede a non-alphameric with backslash to specify that
1657 it stands for itself. In particular, if you want to match a backslash,
1658 you write \\.
1659
1660 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
1661 the pattern (other than in a character class) and characters between a
1662 # outside a character class and the next newline character are ignored.
1663 An escaping backslash can be used to include a whitespace or # charac-
1664 ter as part of the pattern.
1665
1666 If you want to remove the special meaning from a sequence of charac-
1667 ters, you can do so by putting them between \Q and \E. This is differ-
1668 ent from Perl in that $ and @ are handled as literals in \Q...\E
1669 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
1670 tion. Note the following examples:
1671
1672 Pattern PCRE matches Perl matches
1673
1674 \Qabc$xyz\E abc$xyz abc followed by the
1675 contents of $xyz
1676 \Qabc\$xyz\E abc\$xyz abc\$xyz
1677 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1678
1679 The \Q...\E sequence is recognized both inside and outside character
1680 classes.
1681
1682 A second use of backslash provides a way of encoding non-printing char-
1683 acters in patterns in a visible manner. There is no restriction on the
1684 appearance of non-printing characters, apart from the binary zero that
1685 terminates a pattern, but when a pattern is being prepared by text
1686 editing, it is usually easier to use one of the following escape
1687 sequences than the binary character it represents:
1688
1689 \a alarm, that is, the BEL character (hex 07)
1690 \cx "control-x", where x is any character
1691 \e escape (hex 1B)
1692 \f formfeed (hex 0C)
1693 \n newline (hex 0A)
1694 \r carriage return (hex 0D)
1695 \t tab (hex 09)
1696 \ddd character with octal code ddd, or backreference
1697 \xhh character with hex code hh
1698 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1699
1700 The precise effect of \cx is as follows: if x is a lower case letter,
1701 it is converted to upper case. Then bit 6 of the character (hex 40) is
1702 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
1703 becomes hex 7B.
1704
1705 After \x, from zero to two hexadecimal digits are read (letters can be
1706 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
1707 its may appear between \x{ and }, but the value of the character code
1708 must be less than 2**31 (that is, the maximum hexadecimal value is
1709 7FFFFFFF). If characters other than hexadecimal digits appear between
1710 \x{ and }, or if there is no terminating }, this form of escape is not
1711 recognized. Instead, the initial \x will be interpreted as a basic hex-
1712 adecimal escape, with no following digits, giving a byte whose value is
1713 zero.
1714
1715 Characters whose value is less than 256 can be defined by either of the
1716 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
1717 in the way they are handled. For example, \xdc is exactly the same as
1718 \x{dc}.
1719
1720 After \0 up to two further octal digits are read. In both cases, if
1721 there are fewer than two digits, just those that are present are used.
1722 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
1723 character (code value 7). Make sure you supply two digits after the
1724 initial zero if the character that follows is itself an octal digit.
1725
1726 The handling of a backslash followed by a digit other than 0 is compli-
1727 cated. Outside a character class, PCRE reads it and any following dig-
1728 its as a decimal number. If the number is less than 10, or if there
1729 have been at least that many previous capturing left parentheses in the
1730 expression, the entire sequence is taken as a back reference. A
1731 description of how this works is given later, following the discussion
1732 of parenthesized subpatterns.
1733
1734 Inside a character class, or if the decimal number is greater than 9
1735 and there have not been that many capturing subpatterns, PCRE re-reads
1736 up to three octal digits following the backslash, and generates a sin-
1737 gle byte from the least significant 8 bits of the value. Any subsequent
1738 digits stand for themselves. For example:
1739
1740 \040 is another way of writing a space
1741 \40 is the same, provided there are fewer than 40
1742 previous capturing subpatterns
1743 \7 is always a back reference
1744 \11 might be a back reference, or another way of
1745 writing a tab
1746 \011 is always a tab
1747 \0113 is a tab followed by the character "3"
1748 \113 might be a back reference, otherwise the
1749 character with octal code 113
1750 \377 might be a back reference, otherwise
1751 the byte consisting entirely of 1 bits
1752 \81 is either a back reference, or a binary zero
1753 followed by the two characters "8" and "1"
1754
1755 Note that octal values of 100 or greater must not be introduced by a
1756 leading zero, because no more than three octal digits are ever read.
1757
1758 All the sequences that define a single byte value or a single UTF-8
1759 character (in UTF-8 mode) can be used both inside and outside character
1760 classes. In addition, inside a character class, the sequence \b is
1761 interpreted as the backspace character (hex 08). Outside a character
1762 class it has a different meaning (see below).
1763
1764 The third use of backslash is for specifying generic character types:
1765
1766 \d any decimal digit
1767 \D any character that is not a decimal digit
1768 \s any whitespace character
1769 \S any character that is not a whitespace character
1770 \w any "word" character
1771 \W any "non-word" character
1772
1773 Each pair of escape sequences partitions the complete set of characters
1774 into two disjoint sets. Any given character matches one, and only one,
1775 of each pair.
1776
1777 In UTF-8 mode, characters with values greater than 255 never match \d,
1778 \s, or \w, and always match \D, \S, and \W.
1779
1780 For compatibility with Perl, \s does not match the VT character (code
1781 11). This makes it different from the the POSIX "space" class. The \s
1782 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
1783
1784 A "word" character is any letter or digit or the underscore character,
1785 that is, any character which can be part of a Perl "word". The defini-
1786 tion of letters and digits is controlled by PCRE's character tables,
1787 and may vary if locale- specific matching is taking place (see "Locale
1788 support" in the pcreapi page). For example, in the "fr" (French)
1789 locale, some character codes greater than 128 are used for accented
1790 letters, and these are matched by \w.
1791
1792 These character type sequences can appear both inside and outside char-
1793 acter classes. They each match one character of the appropriate type.
1794 If the current matching point is at the end of the subject string, all
1795 of them fail, since there is no character to match.
1796
1797 The fourth use of backslash is for certain simple assertions. An asser-
1798 tion specifies a condition that has to be met at a particular point in
1799 a match, without consuming any characters from the subject string. The
1800 use of subpatterns for more complicated assertions is described below.
1801 The backslashed assertions are
1802
1803 \b matches at a word boundary
1804 \B matches when not at a word boundary
1805 \A matches at start of subject
1806 \Z matches at end of subject or before newline at end
1807 \z matches at end of subject
1808 \G matches at first matching position in subject
1809
1810 These assertions may not appear in character classes (but note that \b
1811 has a different meaning, namely the backspace character, inside a char-
1812 acter class).
1813
1814 A word boundary is a position in the subject string where the current
1815 character and the previous character do not both match \w or \W (i.e.
1816 one matches \w and the other matches \W), or the start or end of the
1817 string if the first or last character matches \w, respectively.
1818
1819 The \A, \Z, and \z assertions differ from the traditional circumflex
1820 and dollar (described below) in that they only ever match at the very
1821 start and end of the subject string, whatever options are set. Thus,
1822 they are independent of multiline mode.
1823
1824 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
1825 startoffset argument of pcre_exec() is non-zero, indicating that match-
1826 ing is to start at a point other than the beginning of the subject, \A
1827 can never match. The difference between \Z and \z is that \Z matches
1828 before a newline that is the last character of the string as well as at
1829 the end of the string, whereas \z matches only at the end.
1830
1831 The \G assertion is true only when the current matching position is at
1832 the start point of the match, as specified by the startoffset argument
1833 of pcre_exec(). It differs from \A when the value of startoffset is
1834 non-zero. By calling pcre_exec() multiple times with appropriate argu-
1835 ments, you can mimic Perl's /g option, and it is in this kind of imple-
1836 mentation where \G can be useful.
1837
1838 Note, however, that PCRE's interpretation of \G, as the start of the
1839 current match, is subtly different from Perl's, which defines it as the
1840 end of the previous match. In Perl, these can be different when the
1841 previously matched string was empty. Because PCRE does just one match
1842 at a time, it cannot reproduce this behaviour.
1843
1844 If all the alternatives of a pattern begin with \G, the expression is
1845 anchored to the starting match position, and the "anchored" flag is set
1846 in the compiled regular expression.
1847
1848
1849 CIRCUMFLEX AND DOLLAR
1850
1851 Outside a character class, in the default matching mode, the circumflex
1852 character is an assertion which is true only if the current matching
1853 point is at the start of the subject string. If the startoffset argu-
1854 ment of pcre_exec() is non-zero, circumflex can never match if the
1855 PCRE_MULTILINE option is unset. Inside a character class, circumflex
1856 has an entirely different meaning (see below).
1857
1858 Circumflex need not be the first character of the pattern if a number
1859 of alternatives are involved, but it should be the first thing in each
1860 alternative in which it appears if the pattern is ever to match that
1861 branch. If all possible alternatives start with a circumflex, that is,
1862 if the pattern is constrained to match only at the start of the sub-
1863 ject, it is said to be an "anchored" pattern. (There are also other
1864 constructs that can cause a pattern to be anchored.)
1865
1866 A dollar character is an assertion which is true only if the current
1867 matching point is at the end of the subject string, or immediately
1868 before a newline character that is the last character in the string (by
1869 default). Dollar need not be the last character of the pattern if a
1870 number of alternatives are involved, but it should be the last item in
1871 any branch in which it appears. Dollar has no special meaning in a
1872 character class.
1873
1874 The meaning of dollar can be changed so that it matches only at the
1875 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
1876 compile time. This does not affect the \Z assertion.
1877
1878 The meanings of the circumflex and dollar characters are changed if the
1879 PCRE_MULTILINE option is set. When this is the case, they match immedi-
1880 ately after and immediately before an internal newline character,
1881 respectively, in addition to matching at the start and end of the sub-
1882 ject string. For example, the pattern /^abc$/ matches the subject
1883 string "def\nabc" in multiline mode, but not otherwise. Consequently,
1884 patterns that are anchored in single line mode because all branches
1885 start with ^ are not anchored in multiline mode, and a match for cir-
1886 cumflex is possible when the startoffset argument of pcre_exec() is
1887 non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE
1888 is set.
1889
1890 Note that the sequences \A, \Z, and \z can be used to match the start
1891 and end of the subject in both modes, and if all branches of a pattern
1892 start with \A it is always anchored, whether PCRE_MULTILINE is set or
1893 not.
1894
1895
1896 FULL STOP (PERIOD, DOT)
1897
1898 Outside a character class, a dot in the pattern matches any one charac-
1899 ter in the subject, including a non-printing character, but not (by
1900 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
1901 which might be more than one byte long, except (by default) for new-
1902 line. If the PCRE_DOTALL option is set, dots match newlines as well.
1903 The handling of dot is entirely independent of the handling of circum-
1904 flex and dollar, the only relationship being that they both involve
1905 newline characters. Dot has no special meaning in a character class.
1906
1907
1908 MATCHING A SINGLE BYTE
1909
1910 Outside a character class, the escape sequence \C matches any one byte,
1911 both in and out of UTF-8 mode. Unlike a dot, it always matches a new-
1912 line. The feature is provided in Perl in order to match individual
1913 bytes in UTF-8 mode. Because it breaks up UTF-8 characters into indi-
1914 vidual bytes, what remains in the string may be a malformed UTF-8
1915 string. For this reason it is best avoided.
1916
1917 PCRE does not allow \C to appear in lookbehind assertions (see below),
1918 because in UTF-8 mode it makes it impossible to calculate the length of
1919 the lookbehind.
1920
1921
1922 SQUARE BRACKETS
1923
1924 An opening square bracket introduces a character class, terminated by a
1925 closing square bracket. A closing square bracket on its own is not spe-
1926 cial. If a closing square bracket is required as a member of the class,
1927 it should be the first data character in the class (after an initial
1928 circumflex, if present) or escaped with a backslash.
1929
1930 A character class matches a single character in the subject. In UTF-8
1931 mode, the character may occupy more than one byte. A matched character
1932 must be in the set of characters defined by the class, unless the first
1933 character in the class definition is a circumflex, in which case the
1934 subject character must not be in the set defined by the class. If a
1935 circumflex is actually required as a member of the class, ensure it is
1936 not the first character, or escape it with a backslash.
1937
1938 For example, the character class [aeiou] matches any lower case vowel,
1939 while [^aeiou] matches any character that is not a lower case vowel.
1940 Note that a circumflex is just a convenient notation for specifying the
1941 characters which are in the class by enumerating those that are not. It
1942 is not an assertion: it still consumes a character from the subject
1943 string, and fails if the current pointer is at the end of the string.
1944
1945 In UTF-8 mode, characters with values greater than 255 can be included
1946 in a class as a literal string of bytes, or by using the \x{ escaping
1947 mechanism.
1948
1949 When caseless matching is set, any letters in a class represent both
1950 their upper case and lower case versions, so for example, a caseless
1951 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1952 match "A", whereas a caseful version would. PCRE does not support the
1953 concept of case for characters with values greater than 255.
1954
1955 The newline character is never treated in any special way in character
1956 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
1957 options is. A class such as [^a] will always match a newline.
1958
1959 The minus (hyphen) character can be used to specify a range of charac-
1960 ters in a character class. For example, [d-m] matches any letter
1961 between d and m, inclusive. If a minus character is required in a
1962 class, it must be escaped with a backslash or appear in a position
1963 where it cannot be interpreted as indicating a range, typically as the
1964 first or last character in the class.
1965
1966 It is not possible to have the literal character "]" as the end charac-
1967 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1968 two characters ("W" and "-") followed by a literal string "46]", so it
1969 would match "W46]" or "-46]". However, if the "]" is escaped with a
1970 backslash it is interpreted as the end of range, so [W-\]46] is inter-
1971 preted as a single class containing a range followed by two separate
1972 characters. The octal or hexadecimal representation of "]" can also be
1973 used to end a range.
1974
1975 Ranges operate in the collating sequence of character values. They can
1976 also be used for characters specified numerically, for example
1977 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
1978 are greater than 255, for example [\x{100}-\x{2ff}].
1979
1980 If a range that includes letters is used when caseless matching is set,
1981 it matches the letters in either case. For example, [W-c] is equivalent
1982 to [][\^_`wxyzabc], matched caselessly, and if character tables for the
1983 "fr" locale are in use, [\xc8-\xcb] matches accented E characters in
1984 both cases.
1985
1986 The character types \d, \D, \s, \S, \w, and \W may also appear in a
1987 character class, and add the characters that they match to the class.
1988 For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
1989 conveniently be used with the upper case character types to specify a
1990 more restricted set of characters than the matching lower case type.
1991 For example, the class [^\W_] matches any letter or digit, but not
1992 underscore.
1993
1994 All non-alphameric characters other than \, -, ^ (at the start) and the
1995 terminating ] are non-special in character classes, but it does no harm
1996 if they are escaped.
1997
1998
1999 POSIX CHARACTER CLASSES
2000
2001 Perl supports the POSIX notation for character classes, which uses
2002 names enclosed by [: and :] within the enclosing square brackets. PCRE
2003 also supports this notation. For example,
2004
2005 [01[:alpha:]%]
2006
2007 matches "0", "1", any alphabetic character, or "%". The supported class
2008 names are
2009
2010 alnum letters and digits
2011 alpha letters
2012 ascii character codes 0 - 127
2013 blank space or tab only
2014 cntrl control characters
2015 digit decimal digits (same as \d)
2016 graph printing characters, excluding space
2017 lower lower case letters
2018 print printing characters, including space
2019 punct printing characters, excluding letters and digits
2020 space white space (not quite the same as \s)
2021 upper upper case letters
2022 word "word" characters (same as \w)
2023 xdigit hexadecimal digits
2024
2025 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
2026 and space (32). Notice that this list includes the VT character (code
2027 11). This makes "space" different to \s, which does not include VT (for
2028 Perl compatibility).
2029
2030 The name "word" is a Perl extension, and "blank" is a GNU extension
2031 from Perl 5.8. Another Perl extension is negation, which is indicated
2032 by a ^ character after the colon. For example,
2033
2034 [12[:^digit:]]
2035
2036 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
2037 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2038 these are not supported, and an error is given if they are encountered.
2039
2040 In UTF-8 mode, characters with values greater than 255 do not match any
2041 of the POSIX character classes.
2042
2043
2044 VERTICAL BAR
2045
2046 Vertical bar characters are used to separate alternative patterns. For
2047 example, the pattern
2048
2049 gilbert|sullivan
2050
2051 matches either "gilbert" or "sullivan". Any number of alternatives may
2052 appear, and an empty alternative is permitted (matching the empty
2053 string). The matching process tries each alternative in turn, from
2054 left to right, and the first one that succeeds is used. If the alterna-
2055 tives are within a subpattern (defined below), "succeeds" means match-
2056 ing the rest of the main pattern as well as the alternative in the sub-
2057 pattern.
2058
2059
2060 INTERNAL OPTION SETTING
2061
2062 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
2063 PCRE_EXTENDED options can be changed from within the pattern by a
2064 sequence of Perl option letters enclosed between "(?" and ")". The
2065 option letters are
2066
2067 i for PCRE_CASELESS
2068 m for PCRE_MULTILINE
2069 s for PCRE_DOTALL
2070 x for PCRE_EXTENDED
2071
2072 For example, (?im) sets caseless, multiline matching. It is also possi-
2073 ble to unset these options by preceding the letter with a hyphen, and a
2074 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
2075 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
2076 is also permitted. If a letter appears both before and after the
2077 hyphen, the option is unset.
2078
2079 When an option change occurs at top level (that is, not inside subpat-
2080 tern parentheses), the change applies to the remainder of the pattern
2081 that follows. If the change is placed right at the start of a pattern,
2082 PCRE extracts it into the global options (and it will therefore show up
2083 in data extracted by the pcre_fullinfo() function).
2084
2085 An option change within a subpattern affects only that part of the cur-
2086 rent pattern that follows it, so
2087
2088 (a(?i)b)c
2089
2090 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
2091 used). By this means, options can be made to have different settings
2092 in different parts of the pattern. Any changes made in one alternative
2093 do carry on into subsequent branches within the same subpattern. For
2094 example,
2095
2096 (a(?i)b|c)
2097
2098 matches "ab", "aB", "c", and "C", even though when matching "C" the
2099 first branch is abandoned before the option setting. This is because
2100 the effects of option settings happen at compile time. There would be
2101 some very weird behaviour otherwise.
2102
2103 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
2104 in the same way as the Perl-compatible options by using the characters
2105 U and X respectively. The (?X) flag setting is special in that it must
2106 always occur earlier in the pattern than any of the additional features
2107 it turns on, even when it is at top level. It is best put at the start.
2108
2109
2110 SUBPATTERNS
2111
2112 Subpatterns are delimited by parentheses (round brackets), which can be
2113 nested. Marking part of a pattern as a subpattern does two things:
2114
2115 1. It localizes a set of alternatives. For example, the pattern
2116
2117 cat(aract|erpillar|)
2118
2119 matches one of the words "cat", "cataract", or "caterpillar". Without
2120 the parentheses, it would match "cataract", "erpillar" or the empty
2121 string.
2122
2123 2. It sets up the subpattern as a capturing subpattern (as defined
2124 above). When the whole pattern matches, that portion of the subject
2125 string that matched the subpattern is passed back to the caller via the
2126 ovector argument of pcre_exec(). Opening parentheses are counted from
2127 left to right (starting from 1) to obtain the numbers of the capturing
2128 subpatterns.
2129
2130 For example, if the string "the red king" is matched against the pat-
2131 tern
2132
2133 the ((red|white) (king|queen))
2134
2135 the captured substrings are "red king", "red", and "king", and are num-
2136 bered 1, 2, and 3, respectively.
2137
2138 The fact that plain parentheses fulfil two functions is not always
2139 helpful. There are often times when a grouping subpattern is required
2140 without a capturing requirement. If an opening parenthesis is followed
2141 by a question mark and a colon, the subpattern does not do any captur-
2142 ing, and is not counted when computing the number of any subsequent
2143 capturing subpatterns. For example, if the string "the white queen" is
2144 matched against the pattern
2145
2146 the ((?:red|white) (king|queen))
2147
2148 the captured substrings are "white queen" and "queen", and are numbered
2149 1 and 2. The maximum number of capturing subpatterns is 65535, and the
2150 maximum depth of nesting of all subpatterns, both capturing and non-
2151 capturing, is 200.
2152
2153 As a convenient shorthand, if any option settings are required at the
2154 start of a non-capturing subpattern, the option letters may appear
2155 between the "?" and the ":". Thus the two patterns
2156
2157 (?i:saturday|sunday)
2158 (?:(?i)saturday|sunday)
2159
2160 match exactly the same set of strings. Because alternative branches are
2161 tried from left to right, and options are not reset until the end of
2162 the subpattern is reached, an option setting in one branch does affect
2163 subsequent branches, so the above patterns match "SUNDAY" as well as
2164 "Saturday".
2165
2166
2167 NAMED SUBPATTERNS
2168
2169 Identifying capturing parentheses by number is simple, but it can be
2170 very hard to keep track of the numbers in complicated regular expres-
2171 sions. Furthermore, if an expression is modified, the numbers may
2172 change. To help with the difficulty, PCRE supports the naming of sub-
2173 patterns, something that Perl does not provide. The Python syntax
2174 (?P<name>...) is used. Names consist of alphanumeric characters and
2175 underscores, and must be unique within a pattern.
2176
2177 Named capturing parentheses are still allocated numbers as well as
2178 names. The PCRE API provides function calls for extracting the name-to-
2179 number translation table from a compiled pattern. For further details
2180 see the pcreapi documentation.
2181
2182
2183 REPETITION
2184
2185 Repetition is specified by quantifiers, which can follow any of the
2186 following items:
2187
2188 a literal data character
2189 the . metacharacter
2190 the \C escape sequence
2191 escapes such as \d that match single characters
2192 a character class
2193 a back reference (see next section)
2194 a parenthesized subpattern (unless it is an assertion)
2195
2196 The general repetition quantifier specifies a minimum and maximum num-
2197 ber of permitted matches, by giving the two numbers in curly brackets
2198 (braces), separated by a comma. The numbers must be less than 65536,
2199 and the first must be less than or equal to the second. For example:
2200
2201 z{2,4}
2202
2203 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
2204 special character. If the second number is omitted, but the comma is
2205 present, there is no upper limit; if the second number and the comma
2206 are both omitted, the quantifier specifies an exact number of required
2207 matches. Thus
2208
2209 [aeiou]{3,}
2210
2211 matches at least 3 successive vowels, but may match many more, while
2212
2213 \d{8}
2214
2215 matches exactly 8 digits. An opening curly bracket that appears in a
2216 position where a quantifier is not allowed, or one that does not match
2217 the syntax of a quantifier, is taken as a literal character. For exam-
2218 ple, {,6} is not a quantifier, but a literal string of four characters.
2219
2220 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
2221 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
2222 acters, each of which is represented by a two-byte sequence.
2223
2224 The quantifier {0} is permitted, causing the expression to behave as if
2225 the previous item and the quantifier were not present.
2226
2227 For convenience (and historical compatibility) the three most common
2228 quantifiers have single-character abbreviations:
2229
2230 * is equivalent to {0,}
2231 + is equivalent to {1,}
2232 ? is equivalent to {0,1}
2233
2234 It is possible to construct infinite loops by following a subpattern
2235 that can match no characters with a quantifier that has no upper limit,
2236 for example:
2237
2238 (a?)*
2239
2240 Earlier versions of Perl and PCRE used to give an error at compile time
2241 for such patterns. However, because there are cases where this can be
2242 useful, such patterns are now accepted, but if any repetition of the
2243 subpattern does in fact match no characters, the loop is forcibly bro-
2244 ken.
2245
2246 By default, the quantifiers are "greedy", that is, they match as much
2247 as possible (up to the maximum number of permitted times), without
2248 causing the rest of the pattern to fail. The classic example of where
2249 this gives problems is in trying to match comments in C programs. These
2250 appear between the sequences /* and */ and within the sequence, indi-
2251 vidual * and / characters may appear. An attempt to match C comments by
2252 applying the pattern
2253
2254 /\*.*\*/
2255
2256 to the string
2257
2258 /* first command */ not comment /* second comment */
2259
2260 fails, because it matches the entire string owing to the greediness of
2261 the .* item.
2262
2263 However, if a quantifier is followed by a question mark, it ceases to
2264 be greedy, and instead matches the minimum number of times possible, so
2265 the pattern
2266
2267 /\*.*?\*/
2268
2269 does the right thing with the C comments. The meaning of the various
2270 quantifiers is not otherwise changed, just the preferred number of
2271 matches. Do not confuse this use of question mark with its use as a
2272 quantifier in its own right. Because it has two uses, it can sometimes
2273 appear doubled, as in
2274
2275 \d??\d
2276
2277 which matches one digit by preference, but can match two if that is the
2278 only way the rest of the pattern matches.
2279
2280 If the PCRE_UNGREEDY option is set (an option which is not available in
2281 Perl), the quantifiers are not greedy by default, but individual ones
2282 can be made greedy by following them with a question mark. In other
2283 words, it inverts the default behaviour.
2284
2285 When a parenthesized subpattern is quantified with a minimum repeat
2286 count that is greater than 1 or with a limited maximum, more store is
2287 required for the compiled pattern, in proportion to the size of the
2288 minimum or maximum.
2289
2290 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
2291 alent to Perl's /s) is set, thus allowing the . to match newlines, the
2292 pattern is implicitly anchored, because whatever follows will be tried
2293 against every character position in the subject string, so there is no
2294 point in retrying the overall match at any position after the first.
2295 PCRE normally treats such a pattern as though it were preceded by \A.
2296
2297 In cases where it is known that the subject string contains no new-
2298 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
2299 mization, or alternatively using ^ to indicate anchoring explicitly.
2300
2301 However, there is one situation where the optimization cannot be used.
2302 When .* is inside capturing parentheses that are the subject of a
2303 backreference elsewhere in the pattern, a match at the start may fail,
2304 and a later one succeed. Consider, for example:
2305
2306 (.*)abc\1
2307
2308 If the subject is "xyz123abc123" the match point is the fourth charac-
2309 ter. For this reason, such a pattern is not implicitly anchored.
2310
2311 When a capturing subpattern is repeated, the value captured is the sub-
2312 string that matched the final iteration. For example, after
2313
2314 (tweedle[dume]{3}\s*)+
2315
2316 has matched "tweedledum tweedledee" the value of the captured substring
2317 is "tweedledee". However, if there are nested capturing subpatterns,
2318 the corresponding captured values may have been set in previous itera-
2319 tions. For example, after
2320
2321 /(a|(b))+/
2322
2323 matches "aba" the value of the second captured substring is "b".
2324
2325
2326 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2327
2328 With both maximizing and minimizing repetition, failure of what follows
2329 normally causes the repeated item to be re-evaluated to see if a dif-
2330 ferent number of repeats allows the rest of the pattern to match. Some-
2331 times it is useful to prevent this, either to change the nature of the
2332 match, or to cause it fail earlier than it otherwise might, when the
2333 author of the pattern knows there is no point in carrying on.
2334
2335 Consider, for example, the pattern \d+foo when applied to the subject
2336 line
2337
2338 123456bar
2339
2340 After matching all 6 digits and then failing to match "foo", the normal
2341 action of the matcher is to try again with only 5 digits matching the
2342 \d+ item, and then with 4, and so on, before ultimately failing.
2343 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
2344 the means for specifying that once a subpattern has matched, it is not
2345 to be re-evaluated in this way.
2346
2347 If we use atomic grouping for the previous example, the matcher would
2348 give up immediately on failing to match "foo" the first time. The nota-
2349 tion is a kind of special parenthesis, starting with (?> as in this
2350 example:
2351
2352 (?>\d+)foo
2353
2354 This kind of parenthesis "locks up" the part of the pattern it con-
2355 tains once it has matched, and a failure further into the pattern is
2356 prevented from backtracking into it. Backtracking past it to previous
2357 items, however, works as normal.
2358
2359 An alternative description is that a subpattern of this type matches
2360 the string of characters that an identical standalone pattern would
2361 match, if anchored at the current point in the subject string.
2362
2363 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
2364 such as the above example can be thought of as a maximizing repeat that
2365 must swallow everything it can. So, while both \d+ and \d+? are pre-
2366 pared to adjust the number of digits they match in order to make the
2367 rest of the pattern match, (?>\d+) can only match an entire sequence of
2368 digits.
2369
2370 Atomic groups in general can of course contain arbitrarily complicated
2371 subpatterns, and can be nested. However, when the subpattern for an
2372 atomic group is just a single repeated item, as in the example above, a
2373 simpler notation, called a "possessive quantifier" can be used. This
2374 consists of an additional + character following a quantifier. Using
2375 this notation, the previous example can be rewritten as
2376
2377 \d++bar
2378
2379 Possessive quantifiers are always greedy; the setting of the
2380 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
2381 simpler forms of atomic group. However, there is no difference in the
2382 meaning or processing of a possessive quantifier and the equivalent
2383 atomic group.
2384
2385 The possessive quantifier syntax is an extension to the Perl syntax. It
2386 originates in Sun's Java package.
2387
2388 When a pattern contains an unlimited repeat inside a subpattern that
2389 can itself be repeated an unlimited number of times, the use of an
2390 atomic group is the only way to avoid some failing matches taking a
2391 very long time indeed. The pattern
2392
2393 (\D+|<\d+>)*[!?]
2394
2395 matches an unlimited number of substrings that either consist of non-
2396 digits, or digits enclosed in <>, followed by either ! or ?. When it
2397 matches, it runs quickly. However, if it is applied to
2398
2399 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2400
2401 it takes a long time before reporting failure. This is because the
2402 string can be divided between the two repeats in a large number of
2403 ways, and all have to be tried. (The example used [!?] rather than a
2404 single character at the end, because both PCRE and Perl have an opti-
2405 mization that allows for fast failure when a single character is used.
2406 They remember the last single character that is required for a match,
2407 and fail early if it is not present in the string.) If the pattern is
2408 changed to
2409
2410 ((?>\D+)|<\d+>)*[!?]
2411
2412 sequences of non-digits cannot be broken, and failure happens quickly.
2413
2414
2415 BACK REFERENCES
2416
2417 Outside a character class, a backslash followed by a digit greater than
2418 0 (and possibly further digits) is a back reference to a capturing sub-
2419 pattern earlier (that is, to its left) in the pattern, provided there
2420 have been that many previous capturing left parentheses.
2421
2422 However, if the decimal number following the backslash is less than 10,
2423 it is always taken as a back reference, and causes an error only if
2424 there are not that many capturing left parentheses in the entire pat-
2425 tern. In other words, the parentheses that are referenced need not be
2426 to the left of the reference for numbers less than 10. See the section
2427 entitled "Backslash" above for further details of the handling of dig-
2428 its following a backslash.
2429
2430 A back reference matches whatever actually matched the capturing sub-
2431 pattern in the current subject string, rather than anything matching
2432 the subpattern itself (see "Subpatterns as subroutines" below for a way
2433 of doing that). So the pattern
2434
2435 (sens|respons)e and \1ibility
2436
2437 matches "sense and sensibility" and "response and responsibility", but
2438 not "sense and responsibility". If caseful matching is in force at the
2439 time of the back reference, the case of letters is relevant. For exam-
2440 ple,
2441
2442 ((?i)rah)\s+\1
2443
2444 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
2445 original capturing subpattern is matched caselessly.
2446
2447 Back references to named subpatterns use the Python syntax (?P=name).
2448 We could rewrite the above example as follows:
2449
2450 (?<p1>(?i)rah)\s+(?P=p1)
2451
2452 There may be more than one back reference to the same subpattern. If a
2453 subpattern has not actually been used in a particular match, any back
2454 references to it always fail. For example, the pattern
2455
2456 (a|(bc))\2
2457
2458 always fails if it starts to match "a" rather than "bc". Because there
2459 may be many capturing parentheses in a pattern, all digits following
2460 the backslash are taken as part of a potential back reference number.
2461 If the pattern continues with a digit character, some delimiter must be
2462 used to terminate the back reference. If the PCRE_EXTENDED option is
2463 set, this can be whitespace. Otherwise an empty comment can be used.
2464
2465 A back reference that occurs inside the parentheses to which it refers
2466 fails when the subpattern is first used, so, for example, (a\1) never
2467 matches. However, such references can be useful inside repeated sub-
2468 patterns. For example, the pattern
2469
2470 (a|b\1)+
2471
2472 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
2473 ation of the subpattern, the back reference matches the character
2474 string corresponding to the previous iteration. In order for this to
2475 work, the pattern must be such that the first iteration does not need
2476 to match the back reference. This can be done using alternation, as in
2477 the example above, or by a quantifier with a minimum of zero.
2478
2479
2480 ASSERTIONS
2481
2482 An assertion is a test on the characters following or preceding the
2483 current matching point that does not actually consume any characters.
2484 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
2485 described above. More complicated assertions are coded as subpatterns.
2486 There are two kinds: those that look ahead of the current position in
2487 the subject string, and those that look behind it.
2488
2489 An assertion subpattern is matched in the normal way, except that it
2490 does not cause the current matching position to be changed. Lookahead
2491 assertions start with (?= for positive assertions and (?! for negative
2492 assertions. For example,
2493
2494 \w+(?=;)
2495
2496 matches a word followed by a semicolon, but does not include the semi-
2497 colon in the match, and
2498
2499 foo(?!bar)
2500
2501 matches any occurrence of "foo" that is not followed by "bar". Note
2502 that the apparently similar pattern
2503
2504 (?!foo)bar
2505
2506 does not find an occurrence of "bar" that is preceded by something
2507 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2508 the assertion (?!foo) is always true when the next three characters are
2509 "bar". A lookbehind assertion is needed to achieve this effect.
2510
2511 If you want to force a matching failure at some point in a pattern, the
2512 most convenient way to do it is with (?!) because an empty string
2513 always matches, so an assertion that requires there not to be an empty
2514 string must always fail.
2515
2516 Lookbehind assertions start with (?<= for positive assertions and (?<!
2517 for negative assertions. For example,
2518
2519 (?<!foo)bar
2520
2521 does find an occurrence of "bar" that is not preceded by "foo". The
2522 contents of a lookbehind assertion are restricted such that all the
2523 strings it matches must have a fixed length. However, if there are sev-
2524 eral alternatives, they do not all have to have the same fixed length.
2525 Thus
2526
2527 (?<=bullock|donkey)
2528
2529 is permitted, but
2530
2531 (?<!dogs?|cats?)
2532
2533 causes an error at compile time. Branches that match different length
2534 strings are permitted only at the top level of a lookbehind assertion.
2535 This is an extension compared with Perl (at least for 5.8), which
2536 requires all branches to match the same length of string. An assertion
2537 such as
2538
2539 (?<=ab(c|de))
2540
2541 is not permitted, because its single top-level branch can match two
2542 different lengths, but it is acceptable if rewritten to use two top-
2543 level branches:
2544
2545 (?<=abc|abde)
2546
2547 The implementation of lookbehind assertions is, for each alternative,
2548 to temporarily move the current position back by the fixed width and
2549 then try to match. If there are insufficient characters before the cur-
2550 rent position, the match is deemed to fail.
2551
2552 PCRE does not allow the \C escape (which matches a single byte in UTF-8
2553 mode) to appear in lookbehind assertions, because it makes it impossi-
2554 ble to calculate the length of the lookbehind.
2555
2556 Atomic groups can be used in conjunction with lookbehind assertions to
2557 specify efficient matching at the end of the subject string. Consider a
2558 simple pattern such as
2559
2560 abcd$
2561
2562 when applied to a long string that does not match. Because matching
2563 proceeds from left to right, PCRE will look for each "a" in the subject
2564 and then see if what follows matches the rest of the pattern. If the
2565 pattern is specified as
2566
2567 ^.*abcd$
2568
2569 the initial .* matches the entire string at first, but when this fails
2570 (because there is no following "a"), it backtracks to match all but the
2571 last character, then all but the last two characters, and so on. Once
2572 again the search for "a" covers the entire string, from right to left,
2573 so we are no better off. However, if the pattern is written as
2574
2575 ^(?>.*)(?<=abcd)
2576
2577 or, equivalently,
2578
2579 ^.*+(?<=abcd)
2580
2581 there can be no backtracking for the .* item; it can match only the
2582 entire string. The subsequent lookbehind assertion does a single test
2583 on the last four characters. If it fails, the match fails immediately.
2584 For long strings, this approach makes a significant difference to the
2585 processing time.
2586
2587 Several assertions (of any sort) may occur in succession. For example,
2588
2589 (?<=\d{3})(?<!999)foo
2590
2591 matches "foo" preceded by three digits that are not "999". Notice that
2592 each of the assertions is applied independently at the same point in
2593 the subject string. First there is a check that the previous three
2594 characters are all digits, and then there is a check that the same
2595 three characters are not "999". This pattern does not match "foo" pre-
2596 ceded by six characters, the first of which are digits and the last
2597 three of which are not "999". For example, it doesn't match "123abc-
2598 foo". A pattern to do that is
2599
2600 (?<=\d{3}...)(?<!999)foo
2601
2602 This time the first assertion looks at the preceding six characters,
2603 checking that the first three are digits, and then the second assertion
2604 checks that the preceding three characters are not "999".
2605
2606 Assertions can be nested in any combination. For example,
2607
2608 (?<=(?<!foo)bar)baz
2609
2610 matches an occurrence of "baz" that is preceded by "bar" which in turn
2611 is not preceded by "foo", while
2612
2613 (?<=\d{3}(?!999)...)foo
2614
2615 is another pattern which matches "foo" preceded by three digits and any
2616 three characters that are not "999".
2617
2618 Assertion subpatterns are not capturing subpatterns, and may not be
2619 repeated, because it makes no sense to assert the same thing several
2620 times. If any kind of assertion contains capturing subpatterns within
2621 it, these are counted for the purposes of numbering the capturing sub-
2622 patterns in the whole pattern. However, substring capturing is carried
2623 out only for positive assertions, because it does not make sense for
2624 negative assertions.
2625
2626
2627 CONDITIONAL SUBPATTERNS
2628
2629 It is possible to cause the matching process to obey a subpattern con-
2630 ditionally or to choose between two alternative subpatterns, depending
2631 on the result of an assertion, or whether a previous capturing
2632 subpattern matched or not. The two possible forms of conditional sub-
2633 pattern are
2634
2635 (?(condition)yes-pattern)
2636 (?(condition)yes-pattern|no-pattern)
2637
2638 If the condition is satisfied, the yes-pattern is used; otherwise the
2639 no-pattern (if present) is used. If there are more than two alterna-
2640 tives in the subpattern, a compile-time error occurs.
2641
2642 There are three kinds of condition. If the text between the parentheses
2643 consists of a sequence of digits, the condition is satisfied if the
2644 capturing subpattern of that number has previously matched. The number
2645 must be greater than zero. Consider the following pattern, which con-
2646 tains non-significant white space to make it more readable (assume the
2647 PCRE_EXTENDED option) and to divide it into three parts for ease of
2648 discussion:
2649
2650 ( \( )? [^()]+ (?(1) \) )
2651
2652 The first part matches an optional opening parenthesis, and if that
2653 character is present, sets it as the first captured substring. The sec-
2654 ond part matches one or more characters that are not parentheses. The
2655 third part is a conditional subpattern that tests whether the first set
2656 of parentheses matched or not. If they did, that is, if subject started
2657 with an opening parenthesis, the condition is true, and so the yes-pat-
2658 tern is executed and a closing parenthesis is required. Otherwise,
2659 since no-pattern is not present, the subpattern matches nothing. In
2660 other words, this pattern matches a sequence of non-parentheses,
2661 optionally enclosed in parentheses.
2662
2663 If the condition is the string (R), it is satisfied if a recursive call
2664 to the pattern or subpattern has been made. At "top level", the condi-
2665 tion is false. This is a PCRE extension. Recursive patterns are
2666 described in the next section.
2667
2668 If the condition is not a sequence of digits or (R), it must be an
2669 assertion. This may be a positive or negative lookahead or lookbehind
2670 assertion. Consider this pattern, again containing non-significant
2671 white space, and with the two alternatives on the second line:
2672
2673 (?(?=[^a-z]*[a-z])
2674 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2675
2676 The condition is a positive lookahead assertion that matches an
2677 optional sequence of non-letters followed by a letter. In other words,
2678 it tests for the presence of at least one letter in the subject. If a
2679 letter is found, the subject is matched against the first alternative;
2680 otherwise it is matched against the second. This pattern matches
2681 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2682 letters and dd are digits.
2683
2684
2685 COMMENTS
2686
2687 The sequence (?# marks the start of a comment which continues up to the
2688 next closing parenthesis. Nested parentheses are not permitted. The
2689 characters that make up a comment play no part in the pattern matching
2690 at all.
2691
2692 If the PCRE_EXTENDED option is set, an unescaped # character outside a
2693 character class introduces a comment that continues up to the next new-
2694 line character in the pattern.
2695
2696
2697 RECURSIVE PATTERNS
2698
2699 Consider the problem of matching a string in parentheses, allowing for
2700 unlimited nested parentheses. Without the use of recursion, the best
2701 that can be done is to use a pattern that matches up to some fixed
2702 depth of nesting. It is not possible to handle an arbitrary nesting
2703 depth. Perl has provided an experimental facility that allows regular
2704 expressions to recurse (amongst other things). It does this by interpo-
2705 lating Perl code in the expression at run time, and the code can refer
2706 to the expression itself. A Perl pattern to solve the parentheses prob-
2707 lem can be created like this:
2708
2709 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2710
2711 The (?p{...}) item interpolates Perl code at run time, and in this case
2712 refers recursively to the pattern in which it appears. Obviously, PCRE
2713 cannot support the interpolation of Perl code. Instead, it supports
2714 some special syntax for recursion of the entire pattern, and also for
2715 individual subpattern recursion.
2716
2717 The special item that consists of (? followed by a number greater than
2718 zero and a closing parenthesis is a recursive call of the subpattern of
2719 the given number, provided that it occurs inside that subpattern. (If
2720 not, it is a "subroutine" call, which is described in the next sec-
2721 tion.) The special item (?R) is a recursive call of the entire regular
2722 expression.
2723
2724 For example, this PCRE pattern solves the nested parentheses problem
2725 (assume the PCRE_EXTENDED option is set so that white space is
2726 ignored):
2727
2728 \( ( (?>[^()]+) | (?R) )* \)
2729
2730 First it matches an opening parenthesis. Then it matches any number of
2731 substrings which can either be a sequence of non-parentheses, or a
2732 recursive match of the pattern itself (that is a correctly parenthe-
2733 sized substring). Finally there is a closing parenthesis.
2734
2735 If this were part of a larger pattern, you would not want to recurse
2736 the entire pattern, so instead you could use this:
2737
2738 ( \( ( (?>[^()]+) | (?1) )* \) )
2739
2740 We have put the pattern into parentheses, and caused the recursion to
2741 refer to them instead of the whole pattern. In a larger pattern, keep-
2742 ing track of parenthesis numbers can be tricky. It may be more conve-
2743 nient to use named parentheses instead. For this, PCRE uses (?P>name),
2744 which is an extension to the Python syntax that PCRE uses for named
2745 parentheses (Perl does not provide named parentheses). We could rewrite
2746 the above example as follows:
2747
2748 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2749
2750 This particular example pattern contains nested unlimited repeats, and
2751 so the use of atomic grouping for matching strings of non-parentheses
2752 is important when applying the pattern to strings that do not match.
2753 For example, when this pattern is applied to
2754
2755 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2756
2757 it yields "no match" quickly. However, if atomic grouping is not used,
2758 the match runs for a very long time indeed because there are so many
2759 different ways the + and * repeats can carve up the subject, and all
2760 have to be tested before failure can be reported.
2761
2762 At the end of a match, the values set for any capturing subpatterns are
2763 those from the outermost level of the recursion at which the subpattern
2764 value is set. If you want to obtain intermediate values, a callout
2765 function can be used (see below and the pcrecallout documentation). If
2766 the pattern above is matched against
2767
2768 (ab(cd)ef)
2769
2770 the value for the capturing parentheses is "ef", which is the last
2771 value taken on at the top level. If additional parentheses are added,
2772 giving
2773
2774 \( ( ( (?>[^()]+) | (?R) )* ) \)
2775 ^ ^
2776 ^ ^
2777
2778 the string they capture is "ab(cd)ef", the contents of the top level
2779 parentheses. If there are more than 15 capturing parentheses in a pat-
2780 tern, PCRE has to obtain extra memory to store data during a recursion,
2781 which it does by using pcre_malloc, freeing it via pcre_free after-
2782 wards. If no memory can be obtained, the match fails with the
2783 PCRE_ERROR_NOMEMORY error.
2784
2785 Do not confuse the (?R) item with the condition (R), which tests for
2786 recursion. Consider this pattern, which matches text in angle brack-
2787 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2788 brackets (that is, when recursing), whereas any characters are permit-
2789 ted at the outer level.
2790
2791 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2792
2793 In this pattern, (?(R) is the start of a conditional subpattern, with
2794 two different alternatives for the recursive and non-recursive cases.
2795 The (?R) item is the actual recursive call.
2796
2797
2798 SUBPATTERNS AS SUBROUTINES
2799
2800 If the syntax for a recursive subpattern reference (either by number or
2801 by name) is used outside the parentheses to which it refers, it oper-
2802 ates like a subroutine in a programming language. An earlier example
2803 pointed out that the pattern
2804
2805 (sens|respons)e and \1ibility
2806
2807 matches "sense and sensibility" and "response and responsibility", but
2808 not "sense and responsibility". If instead the pattern
2809
2810 (sens|respons)e and (?1)ibility
2811
2812 is used, it does match "sense and responsibility" as well as the other
2813 two strings. Such references must, however, follow the subpattern to
2814 which they refer.
2815
2816
2817 CALLOUTS
2818
2819 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2820 Perl code to be obeyed in the middle of matching a regular expression.
2821 This makes it possible, amongst other things, to extract different sub-
2822 strings that match the same pair of parentheses when there is a repeti-
2823 tion.
2824
2825 PCRE provides a similar feature, but of course it cannot obey arbitrary
2826 Perl code. The feature is called "callout". The caller of PCRE provides
2827 an external function by putting its entry point in the global variable
2828 pcre_callout. By default, this variable contains NULL, which disables
2829 all calling out.
2830
2831 Within a regular expression, (?C) indicates the points at which the
2832 external function is to be called. If you want to identify different
2833 callout points, you can put a number less than 256 after the letter C.
2834 The default value is zero. For example, this pattern has two callout
2835 points:
2836
2837 (?C1)abc(?C2)def
2838
2839 During matching, when PCRE reaches a callout point (and pcre_callout is
2840 set), the external function is called. It is provided with the number
2841 of the callout, and, optionally, one item of data originally supplied
2842 by the caller of pcre_exec(). The callout function may cause matching
2843 to backtrack, or to fail altogether. A complete description of the
2844 interface to the callout function is given in the pcrecallout documen-
2845 tation.
2846
2847 Last updated: 03 February 2003
2848 Copyright (c) 1997-2003 University of Cambridge.
2849 -----------------------------------------------------------------------------
2850
2851 PCRE(3) PCRE(3)
2852
2853
2854
2855 NAME
2856 PCRE - Perl-compatible regular expressions
2857
2858 PCRE PERFORMANCE
2859
2860 Certain items that may appear in regular expression patterns are more
2861 efficient than others. It is more efficient to use a character class
2862 like [aeiou] than a set of alternatives such as (a|e|i|o|u). In gen-
2863 eral, the simplest construction that provides the required behaviour is
2864 usually the most efficient. Jeffrey Friedl's book contains a lot of
2865 discussion about optimizing regular expressions for efficient perfor-
2866 mance.
2867
2868 When a pattern begins with .* not in parentheses, or in parentheses
2869 that are not the subject of a backreference, and the PCRE_DOTALL option
2870 is set, the pattern is implicitly anchored by PCRE, since it can match
2871 only at the start of a subject string. However, if PCRE_DOTALL is not
2872 set, PCRE cannot make this optimization, because the . metacharacter
2873 does not then match a newline, and if the subject string contains new-
2874 lines, the pattern may match from the character immediately following
2875 one of them instead of from the very start. For example, the pattern
2876
2877 .*second
2878
2879 matches the subject "first\nand second" (where \n stands for a newline
2880 character), with the match starting at the seventh character. In order
2881 to do this, PCRE has to retry the match starting after every newline in
2882 the subject.
2883
2884 If you are using such a pattern with subject strings that do not con-
2885 tain newlines, the best performance is obtained by setting PCRE_DOTALL,
2886 or starting the pattern with ^.* to indicate explicit anchoring. That
2887 saves PCRE from having to scan along the subject looking for a newline
2888 to restart at.
2889
2890 Beware of patterns that contain nested indefinite repeats. These can
2891 take a long time to run when applied to a string that does not match.
2892 Consider the pattern fragment
2893
2894 (a+)*
2895
2896 This can match "aaaa" in 33 different ways, and this number increases
2897 very rapidly as the string gets longer. (The * repeat can match 0, 1,
2898 2, 3, or 4 times, and for each of those cases other than 0, the +
2899 repeats can match different numbers of times.) When the remainder of
2900 the pattern is such that the entire match is going to fail, PCRE has in
2901 principle to try every possible variation, and this can take an
2902 extremely long time.
2903
2904 An optimization catches some of the more simple cases such as
2905
2906 (a+)*b
2907
2908 where a literal character follows. Before embarking on the standard
2909 matching procedure, PCRE checks that there is a "b" later in the sub-
2910 ject string, and if there is not, it fails the match immediately. How-
2911 ever, when there is no following literal this optimization cannot be
2912 used. You can see the difference by comparing the behaviour of
2913
2914 (a+)*\d
2915
2916 with the pattern above. The former gives a failure almost instantly
2917 when applied to a whole line of "a" characters, whereas the latter
2918 takes an appreciable time with strings longer than about 20 characters.
2919
2920 Last updated: 03 February 2003
2921 Copyright (c) 1997-2003 University of Cambridge.
2922 -----------------------------------------------------------------------------
2923
2924 PCRE(3) PCRE(3)
2925
2926
2927
2928 NAME
2929 PCRE - Perl-compatible regular expressions.
2930
2931 SYNOPSIS OF POSIX API
2932 #include <pcreposix.h>
2933
2934 int regcomp(regex_t *preg, const char *pattern,
2935 int cflags);
2936
2937 int regexec(regex_t *preg, const char *string,
2938 size_t nmatch, regmatch_t pmatch[], int eflags);
2939
2940 size_t regerror(int errcode, const regex_t *preg,
2941 char *errbuf, size_t errbuf_size);
2942
2943 void regfree(regex_t *preg);
2944
2945
2946 DESCRIPTION
2947
2948 This set of functions provides a POSIX-style API to the PCRE regular
2949 expression package. See the pcreapi documentation for a description of
2950 the native API, which contains additional functionality.
2951
2952 The functions described here are just wrapper functions that ultimately
2953 call the PCRE native API. Their prototypes are defined in the
2954 pcreposix.h header file, and on Unix systems the library itself is
2955 called pcreposix.a, so can be accessed by adding -lpcreposix to the
2956 command for linking an application which uses them. Because the POSIX
2957 functions call the native ones, it is also necessary to add -lpcre.
2958
2959 I have implemented only those option bits that can be reasonably mapped
2960 to PCRE native options. In addition, the options REG_EXTENDED and
2961 REG_NOSUB are defined with the value zero. They have no effect, but
2962 since programs that are written to the POSIX interface often use them,
2963 this makes it easier to slot in PCRE as a replacement library. Other
2964 POSIX options are not even defined.
2965
2966 When PCRE is called via these functions, it is only the API that is
2967 POSIX-like in style. The syntax and semantics of the regular expres-
2968 sions themselves are still those of Perl, subject to the setting of
2969 various PCRE options, as described below. "POSIX-like in style" means
2970 that the API approximates to the POSIX definition; it is not fully
2971 POSIX-compatible, and in multi-byte encoding domains it is probably
2972 even less compatible.
2973
2974 The header for these functions is supplied as pcreposix.h to avoid any
2975 potential clash with other POSIX libraries. It can, of course, be
2976 renamed or aliased as regex.h, which is the "correct" name. It provides
2977 two structure types, regex_t for compiled internal forms, and reg-
2978 match_t for returning captured substrings. It also defines some con-
2979 stants whose names start with "REG_"; these are used for setting
2980 options and identifying error codes.
2981
2982
2983 COMPILING A PATTERN
2984
2985 The function regcomp() is called to compile a pattern into an internal
2986 form. The pattern is a C string terminated by a binary zero, and is
2987 passed in the argument pattern. The preg argument is a pointer to a
2988 regex_t structure which is used as a base for storing information about
2989 the compiled expression.
2990
2991 The argument cflags is either zero, or contains one or more of the bits
2992 defined by the following macros:
2993
2994 REG_ICASE
2995
2996 The PCRE_CASELESS option is set when the expression is passed for com-
2997 pilation to the native function.
2998
2999 REG_NEWLINE
3000
3001 The PCRE_MULTILINE option is set when the expression is passed for com-
3002 pilation to the native function. Note that this does not mimic the
3003 defined POSIX behaviour for REG_NEWLINE (see the following section).
3004
3005 In the absence of these flags, no options are passed to the native
3006 function. This means the the regex is compiled with PCRE default
3007 semantics. In particular, the way it handles newline characters in the
3008 subject string is the Perl way, not the POSIX way. Note that setting
3009 PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
3010 It does not affect the way newlines are matched by . (they aren't) or
3011 by a negative class such as [^a] (they are).
3012
3013 The yield of regcomp() is zero on success, and non-zero otherwise. The
3014 preg structure is filled in on success, and one member of the structure
3015 is public: re_nsub contains the number of capturing subpatterns in the
3016 regular expression. Various error codes are defined in the header file.
3017
3018
3019 MATCHING NEWLINE CHARACTERS
3020
3021 This area is not simple, because POSIX and Perl take different views of
3022 things. It is not possible to get PCRE to obey POSIX semantics, but
3023 then PCRE was never intended to be a POSIX engine. The following table
3024 lists the different possibilities for matching newline characters in
3025 PCRE:
3026
3027 Default Change with
3028
3029 . matches newline no PCRE_DOTALL
3030 newline matches [^a] yes not changeable
3031 $ matches \n at end yes PCRE_DOLLARENDONLY
3032 $ matches \n in middle no PCRE_MULTILINE
3033 ^ matches \n in middle no PCRE_MULTILINE
3034
3035 This is the equivalent table for POSIX:
3036
3037 Default Change with
3038
3039 . matches newline yes REG_NEWLINE
3040 newline matches [^a] yes REG_NEWLINE
3041 $ matches \n at end no REG_NEWLINE
3042 $ matches \n in middle no REG_NEWLINE
3043 ^ matches \n in middle no REG_NEWLINE
3044
3045 PCRE's behaviour is the same as Perl's, except that there is no equiva-
3046 lent for PCRE_DOLLARENDONLY in Perl. In both PCRE and Perl, there is no
3047 way to stop newline from matching [^a].
3048
3049 The default POSIX newline handling can be obtained by setting
3050 PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way to make PCRE
3051 behave exactly as for the REG_NEWLINE action.
3052
3053
3054 MATCHING A PATTERN
3055
3056 The function regexec() is called to match a pre-compiled pattern preg
3057 against a given string, which is terminated by a zero byte, subject to
3058 the options in eflags. These can be:
3059
3060 REG_NOTBOL
3061
3062 The PCRE_NOTBOL option is set when calling the underlying PCRE matching
3063 function.
3064
3065 REG_NOTEOL
3066
3067 The PCRE_NOTEOL option is set when calling the underlying PCRE matching
3068 function.
3069
3070 The portion of the string that was matched, and also any captured sub-
3071 strings, are returned via the pmatch argument, which points to an array
3072 of nmatch structures of type regmatch_t, containing the members rm_so
3073 and rm_eo. These contain the offset to the first character of each sub-
3074 string and the offset to the first character after the end of each sub-
3075 string, respectively. The 0th element of the vector relates to the
3076 entire portion of string that was matched; subsequent elements relate
3077 to the capturing subpatterns of the regular expression. Unused entries
3078 in the array have both structure members set to -1.
3079
3080 A successful match yields a zero return; various error codes are
3081 defined in the header file, of which REG_NOMATCH is the "expected"
3082 failure code.
3083
3084
3085 ERROR MESSAGES
3086
3087 The regerror() function maps a non-zero errorcode from either regcomp()
3088 or regexec() to a printable message. If preg is not NULL, the error
3089 should have arisen from the use of that structure. A message terminated
3090 by a binary zero is placed in errbuf. The length of the message,
3091 including the zero, is limited to errbuf_size. The yield of the func-
3092 tion is the size of buffer needed to hold the whole message.
3093
3094
3095 STORAGE
3096
3097 Compiling a regular expression causes memory to be allocated and asso-
3098 ciated with the preg structure. The function regfree() frees all such
3099 memory, after which preg may no longer be used as a compiled expres-
3100 sion.
3101
3102
3103 AUTHOR
3104
3105 Philip Hazel <ph10@cam.ac.uk>
3106 University Computing Service,
3107 Cambridge CB2 3QG, England.
3108
3109 Last updated: 03 February 2003
3110 Copyright (c) 1997-2003 University of Cambridge.
3111 -----------------------------------------------------------------------------
3112
3113 PCRE(3) PCRE(3)
3114
3115
3116
3117 NAME
3118 PCRE - Perl-compatible regular expressions
3119
3120 PCRE SAMPLE PROGRAM
3121
3122 A simple, complete demonstration program, to get you started with using
3123 PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
3124
3125 The program compiles the regular expression that is its first argument,
3126 and matches it against the subject string in its second argument. No
3127 PCRE options are set, and default character tables are used. If match-
3128 ing succeeds, the program outputs the portion of the subject that
3129 matched, together with the contents of any captured substrings.
3130
3131 If the -g option is given on the command line, the program then goes on
3132 to check for further matches of the same regular expression in the same
3133 subject string. The logic is a little bit tricky because of the possi-
3134 bility of matching an empty string. Comments in the code explain what
3135 is going on.
3136
3137 On a Unix system that has PCRE installed in /usr/local, you can compile
3138 the demonstration program using a command like this:
3139
3140 gcc -o pcredemo pcredemo.c -I/usr/local/include \
3141 -L/usr/local/lib -lpcre
3142
3143 Then you can run simple tests like this:
3144
3145 ./pcredemo 'cat|dog' 'the cat sat on the mat'
3146 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3147
3148 Note that there is a much more comprehensive test program, called
3149 pcretest, which supports many more facilities for testing regular
3150 expressions and the PCRE library. The pcredemo program is provided as a
3151 simple coding example.
3152
3153 On some operating systems (e.g. Solaris) you may get an error like this
3154 when you try to run pcredemo:
3155
3156 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
3157 directory
3158
3159 This is caused by the way shared library support works on those sys-
3160 tems. You need to add
3161
3162 -R/usr/local/lib
3163
3164 to the compile command to get round this problem.
3165
3166 Last updated: 28 January 2003
3167 Copyright (c) 1997-2003 University of Cambridge.
3168 -----------------------------------------------------------------------------
3169

  ViewVC Help
Powered by ViewVC 1.1.5