282 |
script, where the optional features are selected or deselected by pro- |
script, where the optional features are selected or deselected by pro- |
283 |
viding options to configure before running the make command. However, |
viding options to configure before running the make command. However, |
284 |
the same options can be selected in both Unix-like and non-Unix-like |
the same options can be selected in both Unix-like and non-Unix-like |
285 |
environments using the GUI facility of CMakeSetup if you are using |
environments using the GUI facility of cmake-gui if you are using CMake |
286 |
CMake instead of configure to build PCRE. |
instead of configure to build PCRE. |
287 |
|
|
288 |
|
There is a lot more information about building PCRE in non-Unix-like |
289 |
|
environments in the file called NON_UNIX_USE, which is part of the PCRE |
290 |
|
distribution. You should consult this file as well as the README file |
291 |
|
if you are building in a non-Unix-like environment. |
292 |
|
|
293 |
The complete list of options for configure (which includes the standard |
The complete list of options for configure (which includes the standard |
294 |
ones such as the selection of the installation directory) can be |
ones such as the selection of the installation directory) can be |
295 |
obtained by running |
obtained by running |
296 |
|
|
297 |
./configure --help |
./configure --help |
298 |
|
|
299 |
The following sections include descriptions of options whose names |
The following sections include descriptions of options whose names |
300 |
begin with --enable or --disable. These settings specify changes to the |
begin with --enable or --disable. These settings specify changes to the |
301 |
defaults for the configure command. Because of the way that configure |
defaults for the configure command. Because of the way that configure |
302 |
works, --enable and --disable always come in pairs, so the complemen- |
works, --enable and --disable always come in pairs, so the complemen- |
303 |
tary option always exists as well, but as it specifies the default, it |
tary option always exists as well, but as it specifies the default, it |
304 |
is not described. |
is not described. |
305 |
|
|
306 |
|
|
321 |
|
|
322 |
--enable-utf8 |
--enable-utf8 |
323 |
|
|
324 |
to the configure command. Of itself, this does not make PCRE treat |
to the configure command. Of itself, this does not make PCRE treat |
325 |
strings as UTF-8. As well as compiling PCRE with this option, you also |
strings as UTF-8. As well as compiling PCRE with this option, you also |
326 |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
327 |
function. |
function. |
328 |
|
|
329 |
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE |
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE |
330 |
expects its input to be either ASCII or UTF-8 (depending on the runtime |
expects its input to be either ASCII or UTF-8 (depending on the runtime |
331 |
option). It is not possible to support both EBCDIC and UTF-8 codes in |
option). It is not possible to support both EBCDIC and UTF-8 codes in |
332 |
the same version of the library. Consequently, --enable-utf8 and |
the same version of the library. Consequently, --enable-utf8 and |
333 |
--enable-ebcdic are mutually exclusive. |
--enable-ebcdic are mutually exclusive. |
334 |
|
|
335 |
|
|
336 |
UNICODE CHARACTER PROPERTY SUPPORT |
UNICODE CHARACTER PROPERTY SUPPORT |
337 |
|
|
338 |
UTF-8 support allows PCRE to process character values greater than 255 |
UTF-8 support allows PCRE to process character values greater than 255 |
339 |
in the strings that it handles. On its own, however, it does not pro- |
in the strings that it handles. On its own, however, it does not pro- |
340 |
vide any facilities for accessing the properties of such characters. If |
vide any facilities for accessing the properties of such characters. If |
341 |
you want to be able to use the pattern escapes \P, \p, and \X, which |
you want to be able to use the pattern escapes \P, \p, and \X, which |
342 |
refer to Unicode character properties, you must add |
refer to Unicode character properties, you must add |
343 |
|
|
344 |
--enable-unicode-properties |
--enable-unicode-properties |
345 |
|
|
346 |
to the configure command. This implies UTF-8 support, even if you have |
to the configure command. This implies UTF-8 support, even if you have |
347 |
not explicitly requested it. |
not explicitly requested it. |
348 |
|
|
349 |
Including Unicode property support adds around 30K of tables to the |
Including Unicode property support adds around 30K of tables to the |
350 |
PCRE library. Only the general category properties such as Lu and Nd |
PCRE library. Only the general category properties such as Lu and Nd |
351 |
are supported. Details are given in the pcrepattern documentation. |
are supported. Details are given in the pcrepattern documentation. |
352 |
|
|
353 |
|
|
354 |
CODE VALUE OF NEWLINE |
CODE VALUE OF NEWLINE |
355 |
|
|
356 |
By default, PCRE interprets the linefeed (LF) character as indicating |
By default, PCRE interprets the linefeed (LF) character as indicating |
357 |
the end of a line. This is the normal newline character on Unix-like |
the end of a line. This is the normal newline character on Unix-like |
358 |
systems. You can compile PCRE to use carriage return (CR) instead, by |
systems. You can compile PCRE to use carriage return (CR) instead, by |
359 |
adding |
adding |
360 |
|
|
361 |
--enable-newline-is-cr |
--enable-newline-is-cr |
362 |
|
|
363 |
to the configure command. There is also a --enable-newline-is-lf |
to the configure command. There is also a --enable-newline-is-lf |
364 |
option, which explicitly specifies linefeed as the newline character. |
option, which explicitly specifies linefeed as the newline character. |
365 |
|
|
366 |
Alternatively, you can specify that line endings are to be indicated by |
Alternatively, you can specify that line endings are to be indicated by |
372 |
|
|
373 |
--enable-newline-is-anycrlf |
--enable-newline-is-anycrlf |
374 |
|
|
375 |
which causes PCRE to recognize any of the three sequences CR, LF, or |
which causes PCRE to recognize any of the three sequences CR, LF, or |
376 |
CRLF as indicating a line ending. Finally, a fifth option, specified by |
CRLF as indicating a line ending. Finally, a fifth option, specified by |
377 |
|
|
378 |
--enable-newline-is-any |
--enable-newline-is-any |
379 |
|
|
380 |
causes PCRE to recognize any Unicode newline sequence. |
causes PCRE to recognize any Unicode newline sequence. |
381 |
|
|
382 |
Whatever line ending convention is selected when PCRE is built can be |
Whatever line ending convention is selected when PCRE is built can be |
383 |
overridden when the library functions are called. At build time it is |
overridden when the library functions are called. At build time it is |
384 |
conventional to use the standard for your operating system. |
conventional to use the standard for your operating system. |
385 |
|
|
386 |
|
|
387 |
WHAT \R MATCHES |
WHAT \R MATCHES |
388 |
|
|
389 |
By default, the sequence \R in a pattern matches any Unicode newline |
By default, the sequence \R in a pattern matches any Unicode newline |
390 |
sequence, whatever has been selected as the line ending sequence. If |
sequence, whatever has been selected as the line ending sequence. If |
391 |
you specify |
you specify |
392 |
|
|
393 |
--enable-bsr-anycrlf |
--enable-bsr-anycrlf |
394 |
|
|
395 |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
396 |
ever is selected when PCRE is built can be overridden when the library |
ever is selected when PCRE is built can be overridden when the library |
397 |
functions are called. |
functions are called. |
398 |
|
|
399 |
|
|
400 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
401 |
|
|
402 |
The PCRE building process uses libtool to build both shared and static |
The PCRE building process uses libtool to build both shared and static |
403 |
Unix libraries by default. You can suppress one of these by adding one |
Unix libraries by default. You can suppress one of these by adding one |
404 |
of |
of |
405 |
|
|
406 |
--disable-shared |
--disable-shared |
412 |
POSIX MALLOC USAGE |
POSIX MALLOC USAGE |
413 |
|
|
414 |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
415 |
umentation), additional working storage is required for holding the |
umentation), additional working storage is required for holding the |
416 |
pointers to capturing substrings, because PCRE requires three integers |
pointers to capturing substrings, because PCRE requires three integers |
417 |
per substring, whereas the POSIX interface provides only two. If the |
per substring, whereas the POSIX interface provides only two. If the |
418 |
number of expected substrings is small, the wrapper function uses space |
number of expected substrings is small, the wrapper function uses space |
419 |
on the stack, because this is faster than using malloc() for each call. |
on the stack, because this is faster than using malloc() for each call. |
420 |
The default threshold above which the stack is no longer used is 10; it |
The default threshold above which the stack is no longer used is 10; it |
427 |
|
|
428 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
429 |
|
|
430 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
431 |
part to another (for example, from an opening parenthesis to an alter- |
part to another (for example, from an opening parenthesis to an alter- |
432 |
nation metacharacter). By default, two-byte values are used for these |
nation metacharacter). By default, two-byte values are used for these |
433 |
offsets, leading to a maximum size for a compiled pattern of around |
offsets, leading to a maximum size for a compiled pattern of around |
434 |
64K. This is sufficient to handle all but the most gigantic patterns. |
64K. This is sufficient to handle all but the most gigantic patterns. |
435 |
Nevertheless, some people do want to process enormous patterns, so it |
Nevertheless, some people do want to process enormous patterns, so it |
436 |
is possible to compile PCRE to use three-byte or four-byte offsets by |
is possible to compile PCRE to use three-byte or four-byte offsets by |
437 |
adding a setting such as |
adding a setting such as |
438 |
|
|
439 |
--with-link-size=3 |
--with-link-size=3 |
440 |
|
|
441 |
to the configure command. The value given must be 2, 3, or 4. Using |
to the configure command. The value given must be 2, 3, or 4. Using |
442 |
longer offsets slows down the operation of PCRE because it has to load |
longer offsets slows down the operation of PCRE because it has to load |
443 |
additional bytes when handling them. |
additional bytes when handling them. |
444 |
|
|
445 |
|
|
446 |
AVOIDING EXCESSIVE STACK USAGE |
AVOIDING EXCESSIVE STACK USAGE |
447 |
|
|
448 |
When matching with the pcre_exec() function, PCRE implements backtrack- |
When matching with the pcre_exec() function, PCRE implements backtrack- |
449 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
450 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
451 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
452 |
suffer from this problem, but it may sometimes be necessary to increase |
suffer from this problem, but it may sometimes be necessary to increase |
453 |
the maximum stack size. There is a discussion in the pcrestack docu- |
the maximum stack size. There is a discussion in the pcrestack docu- |
454 |
mentation.) An alternative approach to recursion that uses memory from |
mentation.) An alternative approach to recursion that uses memory from |
455 |
the heap to remember data, instead of using recursive function calls, |
the heap to remember data, instead of using recursive function calls, |
456 |
has been implemented to work round the problem of limited stack size. |
has been implemented to work round the problem of limited stack size. |
457 |
If you want to build a version of PCRE that works this way, add |
If you want to build a version of PCRE that works this way, add |
458 |
|
|
459 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
460 |
|
|
461 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
462 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
463 |
ment functions. By default these point to malloc() and free(), but you |
ment functions. By default these point to malloc() and free(), but you |
464 |
can replace the pointers so that your own functions are used. |
can replace the pointers so that your own functions are used. |
465 |
|
|
466 |
Separate functions are provided rather than using pcre_malloc and |
Separate functions are provided rather than using pcre_malloc and |
467 |
pcre_free because the usage is very predictable: the block sizes |
pcre_free because the usage is very predictable: the block sizes |
468 |
requested are always the same, and the blocks are always freed in |
requested are always the same, and the blocks are always freed in |
469 |
reverse order. A calling program might be able to implement optimized |
reverse order. A calling program might be able to implement optimized |
470 |
functions that perform better than malloc() and free(). PCRE runs |
functions that perform better than malloc() and free(). PCRE runs |
471 |
noticeably more slowly when built in this way. This option affects only |
noticeably more slowly when built in this way. This option affects only |
472 |
the pcre_exec() function; it is not relevant for the the |
the pcre_exec() function; it is not relevant for the the |
473 |
pcre_dfa_exec() function. |
pcre_dfa_exec() function. |
474 |
|
|
475 |
|
|
476 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
477 |
|
|
478 |
Internally, PCRE has a function called match(), which it calls repeat- |
Internally, PCRE has a function called match(), which it calls repeat- |
479 |
edly (sometimes recursively) when matching a pattern with the |
edly (sometimes recursively) when matching a pattern with the |
480 |
pcre_exec() function. By controlling the maximum number of times this |
pcre_exec() function. By controlling the maximum number of times this |
481 |
function may be called during a single matching operation, a limit can |
function may be called during a single matching operation, a limit can |
482 |
be placed on the resources used by a single call to pcre_exec(). The |
be placed on the resources used by a single call to pcre_exec(). The |
483 |
limit can be changed at run time, as described in the pcreapi documen- |
limit can be changed at run time, as described in the pcreapi documen- |
484 |
tation. The default is 10 million, but this can be changed by adding a |
tation. The default is 10 million, but this can be changed by adding a |
485 |
setting such as |
setting such as |
486 |
|
|
487 |
--with-match-limit=500000 |
--with-match-limit=500000 |
488 |
|
|
489 |
to the configure command. This setting has no effect on the |
to the configure command. This setting has no effect on the |
490 |
pcre_dfa_exec() matching function. |
pcre_dfa_exec() matching function. |
491 |
|
|
492 |
In some environments it is desirable to limit the depth of recursive |
In some environments it is desirable to limit the depth of recursive |
493 |
calls of match() more strictly than the total number of calls, in order |
calls of match() more strictly than the total number of calls, in order |
494 |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
495 |
for-recursion is specified) that is used. A second limit controls this; |
for-recursion is specified) that is used. A second limit controls this; |
496 |
it defaults to the value that is set for --with-match-limit, which |
it defaults to the value that is set for --with-match-limit, which |
497 |
imposes no additional constraints. However, you can set a lower limit |
imposes no additional constraints. However, you can set a lower limit |
498 |
by adding, for example, |
by adding, for example, |
499 |
|
|
500 |
--with-match-limit-recursion=10000 |
--with-match-limit-recursion=10000 |
501 |
|
|
502 |
to the configure command. This value can also be overridden at run |
to the configure command. This value can also be overridden at run |
503 |
time. |
time. |
504 |
|
|
505 |
|
|
506 |
CREATING CHARACTER TABLES AT BUILD TIME |
CREATING CHARACTER TABLES AT BUILD TIME |
507 |
|
|
508 |
PCRE uses fixed tables for processing characters whose code values are |
PCRE uses fixed tables for processing characters whose code values are |
509 |
less than 256. By default, PCRE is built with a set of tables that are |
less than 256. By default, PCRE is built with a set of tables that are |
510 |
distributed in the file pcre_chartables.c.dist. These tables are for |
distributed in the file pcre_chartables.c.dist. These tables are for |
511 |
ASCII codes only. If you add |
ASCII codes only. If you add |
512 |
|
|
513 |
--enable-rebuild-chartables |
--enable-rebuild-chartables |
514 |
|
|
515 |
to the configure command, the distributed tables are no longer used. |
to the configure command, the distributed tables are no longer used. |
516 |
Instead, a program called dftables is compiled and run. This outputs |
Instead, a program called dftables is compiled and run. This outputs |
517 |
the source for new set of tables, created in the default locale of your |
the source for new set of tables, created in the default locale of your |
518 |
C runtime system. (This method of replacing the tables does not work if |
C runtime system. (This method of replacing the tables does not work if |
519 |
you are cross compiling, because dftables is run on the local host. If |
you are cross compiling, because dftables is run on the local host. If |
520 |
you need to create alternative tables when cross compiling, you will |
you need to create alternative tables when cross compiling, you will |
521 |
have to do so "by hand".) |
have to do so "by hand".) |
522 |
|
|
523 |
|
|
524 |
USING EBCDIC CODE |
USING EBCDIC CODE |
525 |
|
|
526 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
527 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
528 |
This is the case for most computer operating systems. PCRE can, how- |
This is the case for most computer operating systems. PCRE can, how- |
529 |
ever, be compiled to run in an EBCDIC environment by adding |
ever, be compiled to run in an EBCDIC environment by adding |
530 |
|
|
531 |
--enable-ebcdic |
--enable-ebcdic |
532 |
|
|
533 |
to the configure command. This setting implies --enable-rebuild-charta- |
to the configure command. This setting implies --enable-rebuild-charta- |
534 |
bles. You should only use it if you know that you are in an EBCDIC |
bles. You should only use it if you know that you are in an EBCDIC |
535 |
environment (for example, an IBM mainframe operating system). The |
environment (for example, an IBM mainframe operating system). The |
536 |
--enable-ebcdic option is incompatible with --enable-utf8. |
--enable-ebcdic option is incompatible with --enable-utf8. |
537 |
|
|
538 |
|
|
546 |
--enable-pcregrep-libbz2 |
--enable-pcregrep-libbz2 |
547 |
|
|
548 |
to the configure command. These options naturally require that the rel- |
to the configure command. These options naturally require that the rel- |
549 |
evant libraries are installed on your system. Configuration will fail |
evant libraries are installed on your system. Configuration will fail |
550 |
if they are not. |
if they are not. |
551 |
|
|
552 |
|
|
556 |
|
|
557 |
--enable-pcretest-libreadline |
--enable-pcretest-libreadline |
558 |
|
|
559 |
to the configure command, pcretest is linked with the libreadline |
to the configure command, pcretest is linked with the libreadline |
560 |
library, and when its input is from a terminal, it reads it using the |
library, and when its input is from a terminal, it reads it using the |
561 |
readline() function. This provides line-editing and history facilities. |
readline() function. This provides line-editing and history facilities. |
562 |
Note that libreadline is GPL-licenced, so if you distribute a binary of |
Note that libreadline is GPL-licenced, so if you distribute a binary of |
563 |
pcretest linked in this way, there may be licensing issues. |
pcretest linked in this way, there may be licensing issues. |
564 |
|
|
565 |
Setting this option causes the -lreadline option to be added to the |
Setting this option causes the -lreadline option to be added to the |
566 |
pcretest build. In many operating environments with a sytem-installed |
pcretest build. In many operating environments with a sytem-installed |
567 |
libreadline this is sufficient. However, in some environments (e.g. if |
libreadline this is sufficient. However, in some environments (e.g. if |
568 |
an unmodified distribution version of readline is in use), some extra |
an unmodified distribution version of readline is in use), some extra |
569 |
configuration may be necessary. The INSTALL file for libreadline says |
configuration may be necessary. The INSTALL file for libreadline says |
570 |
this: |
this: |
571 |
|
|
572 |
"Readline uses the termcap functions, but does not link with the |
"Readline uses the termcap functions, but does not link with the |
573 |
termcap or curses library itself, allowing applications which link |
termcap or curses library itself, allowing applications which link |
574 |
with readline the to choose an appropriate library." |
with readline the to choose an appropriate library." |
575 |
|
|
576 |
If your environment has not been set up so that an appropriate library |
If your environment has not been set up so that an appropriate library |
577 |
is automatically included, you may need to add something like |
is automatically included, you may need to add something like |
578 |
|
|
579 |
LIBS="-ncurses" |
LIBS="-ncurses" |
595 |
|
|
596 |
REVISION |
REVISION |
597 |
|
|
598 |
Last updated: 17 March 2009 |
Last updated: 06 September 2009 |
599 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
600 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
601 |
|
|
701 |
at the fourth character of the subject. The algorithm does not automat- |
at the fourth character of the subject. The algorithm does not automat- |
702 |
ically move on to find matches that start at later positions. |
ically move on to find matches that start at later positions. |
703 |
|
|
704 |
|
Although the general principle of this matching algorithm is that it |
705 |
|
scans the subject string only once, without backtracking, there is one |
706 |
|
exception: when a lookbehind assertion is encountered, the preceding |
707 |
|
characters have to be re-inspected. |
708 |
|
|
709 |
There are a number of features of PCRE regular expressions that are not |
There are a number of features of PCRE regular expressions that are not |
710 |
supported by the alternative matching algorithm. They are as follows: |
supported by the alternative matching algorithm. They are as follows: |
711 |
|
|
712 |
1. Because the algorithm finds all possible matches, the greedy or |
1. Because the algorithm finds all possible matches, the greedy or |
713 |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
714 |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
715 |
sessive quantifiers can make a difference when what follows could also |
sessive quantifiers can make a difference when what follows could also |
716 |
match what is quantified, for example in a pattern like this: |
match what is quantified, for example in a pattern like this: |
717 |
|
|
718 |
^a++\w! |
^a++\w! |
719 |
|
|
720 |
This pattern matches "aaab!" but not "aaa!", which would be matched by |
This pattern matches "aaab!" but not "aaa!", which would be matched by |
721 |
a non-possessive quantifier. Similarly, if an atomic group is present, |
a non-possessive quantifier. Similarly, if an atomic group is present, |
722 |
it is matched as if it were a standalone pattern at the current point, |
it is matched as if it were a standalone pattern at the current point, |
723 |
and the longest match is then "locked in" for the rest of the overall |
and the longest match is then "locked in" for the rest of the overall |
724 |
pattern. |
pattern. |
725 |
|
|
726 |
2. When dealing with multiple paths through the tree simultaneously, it |
2. When dealing with multiple paths through the tree simultaneously, it |
727 |
is not straightforward to keep track of captured substrings for the |
is not straightforward to keep track of captured substrings for the |
728 |
different matching possibilities, and PCRE's implementation of this |
different matching possibilities, and PCRE's implementation of this |
729 |
algorithm does not attempt to do this. This means that no captured sub- |
algorithm does not attempt to do this. This means that no captured sub- |
730 |
strings are available. |
strings are available. |
731 |
|
|
732 |
3. Because no substrings are captured, back references within the pat- |
3. Because no substrings are captured, back references within the pat- |
733 |
tern are not supported, and cause errors if encountered. |
tern are not supported, and cause errors if encountered. |
734 |
|
|
735 |
4. For the same reason, conditional expressions that use a backrefer- |
4. For the same reason, conditional expressions that use a backrefer- |
736 |
ence as the condition or test for a specific group recursion are not |
ence as the condition or test for a specific group recursion are not |
737 |
supported. |
supported. |
738 |
|
|
739 |
5. Because many paths through the tree may be active, the \K escape |
5. Because many paths through the tree may be active, the \K escape |
740 |
sequence, which resets the start of the match when encountered (but may |
sequence, which resets the start of the match when encountered (but may |
741 |
be on some paths and not on others), is not supported. It causes an |
be on some paths and not on others), is not supported. It causes an |
742 |
error if encountered. |
error if encountered. |
743 |
|
|
744 |
6. Callouts are supported, but the value of the capture_top field is |
6. Callouts are supported, but the value of the capture_top field is |
745 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
746 |
|
|
747 |
7. The \C escape sequence, which (in the standard algorithm) matches a |
7. The \C escape sequence, which (in the standard algorithm) matches a |
748 |
single byte, even in UTF-8 mode, is not supported because the alterna- |
single byte, even in UTF-8 mode, is not supported because the alterna- |
749 |
tive algorithm moves through the subject string one character at a |
tive algorithm moves through the subject string one character at a |
750 |
time, for all active paths through the tree. |
time, for all active paths through the tree. |
751 |
|
|
752 |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
753 |
are not supported. (*FAIL) is supported, and behaves like a failing |
are not supported. (*FAIL) is supported, and behaves like a failing |
754 |
negative assertion. |
negative assertion. |
755 |
|
|
756 |
|
|
757 |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
758 |
|
|
759 |
Using the alternative matching algorithm provides the following advan- |
Using the alternative matching algorithm provides the following advan- |
760 |
tages: |
tages: |
761 |
|
|
762 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
763 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
764 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
765 |
things with callouts. |
things with callouts. |
766 |
|
|
767 |
2. Because the alternative algorithm scans the subject string just |
2. Because the alternative algorithm scans the subject string just |
768 |
once, and never needs to backtrack, it is possible to pass very long |
once, and never needs to backtrack, it is possible to pass very long |
769 |
subject strings to the matching function in several pieces, checking |
subject strings to the matching function in several pieces, checking |
770 |
for partial matching each time. |
for partial matching each time. |
771 |
|
|
772 |
|
|
774 |
|
|
775 |
The alternative algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
776 |
|
|
777 |
1. It is substantially slower than the standard algorithm. This is |
1. It is substantially slower than the standard algorithm. This is |
778 |
partly because it has to search for all possible matches, but is also |
partly because it has to search for all possible matches, but is also |
779 |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
780 |
|
|
781 |
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
793 |
|
|
794 |
REVISION |
REVISION |
795 |
|
|
796 |
Last updated: 25 August 2009 |
Last updated: 05 September 2009 |
797 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
798 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
799 |
|
|
912 |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
913 |
ble, is also provided. This uses a different algorithm for the match- |
ble, is also provided. This uses a different algorithm for the match- |
914 |
ing. The alternative algorithm finds all possible matches (at a given |
ing. The alternative algorithm finds all possible matches (at a given |
915 |
point in the subject), and scans the subject just once. However, this |
point in the subject), and scans the subject just once (unless there |
916 |
algorithm does not return captured substrings. A description of the two |
are lookbehind assertions). However, this algorithm does not return |
917 |
matching algorithms and their advantages and disadvantages is given in |
captured substrings. A description of the two matching algorithms and |
918 |
the pcrematching documentation. |
their advantages and disadvantages is given in the pcrematching docu- |
919 |
|
mentation. |
920 |
|
|
921 |
In addition to the main compiling and matching functions, there are |
In addition to the main compiling and matching functions, there are |
922 |
convenience functions for extracting captured substrings from a subject |
convenience functions for extracting captured substrings from a subject |
923 |
string that is matched by pcre_exec(). They are: |
string that is matched by pcre_exec(). They are: |
924 |
|
|
933 |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
934 |
to free the memory used for extracted strings. |
to free the memory used for extracted strings. |
935 |
|
|
936 |
The function pcre_maketables() is used to build a set of character |
The function pcre_maketables() is used to build a set of character |
937 |
tables in the current locale for passing to pcre_compile(), |
tables in the current locale for passing to pcre_compile(), |
938 |
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is |
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is |
939 |
provided for specialist use. Most commonly, no special tables are |
provided for specialist use. Most commonly, no special tables are |
940 |
passed, in which case internal tables that are generated when PCRE is |
passed, in which case internal tables that are generated when PCRE is |
941 |
built are used. |
built are used. |
942 |
|
|
943 |
The function pcre_fullinfo() is used to find out information about a |
The function pcre_fullinfo() is used to find out information about a |
944 |
compiled pattern; pcre_info() is an obsolete version that returns only |
compiled pattern; pcre_info() is an obsolete version that returns only |
945 |
some of the available information, but is retained for backwards com- |
some of the available information, but is retained for backwards com- |
946 |
patibility. The function pcre_version() returns a pointer to a string |
patibility. The function pcre_version() returns a pointer to a string |
947 |
containing the version of PCRE and its date of release. |
containing the version of PCRE and its date of release. |
948 |
|
|
949 |
The function pcre_refcount() maintains a reference count in a data |
The function pcre_refcount() maintains a reference count in a data |
950 |
block containing a compiled pattern. This is provided for the benefit |
block containing a compiled pattern. This is provided for the benefit |
951 |
of object-oriented applications. |
of object-oriented applications. |
952 |
|
|
953 |
The global variables pcre_malloc and pcre_free initially contain the |
The global variables pcre_malloc and pcre_free initially contain the |
954 |
entry points of the standard malloc() and free() functions, respec- |
entry points of the standard malloc() and free() functions, respec- |
955 |
tively. PCRE calls the memory management functions via these variables, |
tively. PCRE calls the memory management functions via these variables, |
956 |
so a calling program can replace them if it wishes to intercept the |
so a calling program can replace them if it wishes to intercept the |
957 |
calls. This should be done before calling any PCRE functions. |
calls. This should be done before calling any PCRE functions. |
958 |
|
|
959 |
The global variables pcre_stack_malloc and pcre_stack_free are also |
The global variables pcre_stack_malloc and pcre_stack_free are also |
960 |
indirections to memory management functions. These special functions |
indirections to memory management functions. These special functions |
961 |
are used only when PCRE is compiled to use the heap for remembering |
are used only when PCRE is compiled to use the heap for remembering |
962 |
data, instead of recursive function calls, when running the pcre_exec() |
data, instead of recursive function calls, when running the pcre_exec() |
963 |
function. See the pcrebuild documentation for details of how to do |
function. See the pcrebuild documentation for details of how to do |
964 |
this. It is a non-standard way of building PCRE, for use in environ- |
this. It is a non-standard way of building PCRE, for use in environ- |
965 |
ments that have limited stacks. Because of the greater use of memory |
ments that have limited stacks. Because of the greater use of memory |
966 |
management, it runs more slowly. Separate functions are provided so |
management, it runs more slowly. Separate functions are provided so |
967 |
that special-purpose external code can be used for this case. When |
that special-purpose external code can be used for this case. When |
968 |
used, these functions are always called in a stack-like manner (last |
used, these functions are always called in a stack-like manner (last |
969 |
obtained, first freed), and always for memory blocks of the same size. |
obtained, first freed), and always for memory blocks of the same size. |
970 |
There is a discussion about PCRE's stack usage in the pcrestack docu- |
There is a discussion about PCRE's stack usage in the pcrestack docu- |
971 |
mentation. |
mentation. |
972 |
|
|
973 |
The global variable pcre_callout initially contains NULL. It can be set |
The global variable pcre_callout initially contains NULL. It can be set |
974 |
by the caller to a "callout" function, which PCRE will then call at |
by the caller to a "callout" function, which PCRE will then call at |
975 |
specified points during a matching operation. Details are given in the |
specified points during a matching operation. Details are given in the |
976 |
pcrecallout documentation. |
pcrecallout documentation. |
977 |
|
|
978 |
|
|
979 |
NEWLINES |
NEWLINES |
980 |
|
|
981 |
PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
982 |
strings: a single CR (carriage return) character, a single LF (line- |
strings: a single CR (carriage return) character, a single LF (line- |
983 |
feed) character, the two-character sequence CRLF, any of the three pre- |
feed) character, the two-character sequence CRLF, any of the three pre- |
984 |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
985 |
are the three just mentioned, plus the single characters VT (vertical |
are the three just mentioned, plus the single characters VT (vertical |
986 |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
987 |
separator, U+2028), and PS (paragraph separator, U+2029). |
separator, U+2028), and PS (paragraph separator, U+2029). |
988 |
|
|
989 |
Each of the first three conventions is used by at least one operating |
Each of the first three conventions is used by at least one operating |
990 |
system as its standard newline sequence. When PCRE is built, a default |
system as its standard newline sequence. When PCRE is built, a default |
991 |
can be specified. The default default is LF, which is the Unix stan- |
can be specified. The default default is LF, which is the Unix stan- |
992 |
dard. When PCRE is run, the default can be overridden, either when a |
dard. When PCRE is run, the default can be overridden, either when a |
993 |
pattern is compiled, or when it is matched. |
pattern is compiled, or when it is matched. |
994 |
|
|
995 |
At compile time, the newline convention can be specified by the options |
At compile time, the newline convention can be specified by the options |
996 |
argument of pcre_compile(), or it can be specified by special text at |
argument of pcre_compile(), or it can be specified by special text at |
997 |
the start of the pattern itself; this overrides any other settings. See |
the start of the pattern itself; this overrides any other settings. See |
998 |
the pcrepattern page for details of the special character sequences. |
the pcrepattern page for details of the special character sequences. |
999 |
|
|
1000 |
In the PCRE documentation the word "newline" is used to mean "the char- |
In the PCRE documentation the word "newline" is used to mean "the char- |
1001 |
acter or pair of characters that indicate a line break". The choice of |
acter or pair of characters that indicate a line break". The choice of |
1002 |
newline convention affects the handling of the dot, circumflex, and |
newline convention affects the handling of the dot, circumflex, and |
1003 |
dollar metacharacters, the handling of #-comments in /x mode, and, when |
dollar metacharacters, the handling of #-comments in /x mode, and, when |
1004 |
CRLF is a recognized line ending sequence, the match position advance- |
CRLF is a recognized line ending sequence, the match position advance- |
1005 |
ment for a non-anchored pattern. There is more detail about this in the |
ment for a non-anchored pattern. There is more detail about this in the |
1006 |
section on pcre_exec() options below. |
section on pcre_exec() options below. |
1007 |
|
|
1008 |
The choice of newline convention does not affect the interpretation of |
The choice of newline convention does not affect the interpretation of |
1009 |
the \n or \r escape sequences, nor does it affect what \R matches, |
the \n or \r escape sequences, nor does it affect what \R matches, |
1010 |
which is controlled in a similar way, but by separate options. |
which is controlled in a similar way, but by separate options. |
1011 |
|
|
1012 |
|
|
1013 |
MULTITHREADING |
MULTITHREADING |
1014 |
|
|
1015 |
The PCRE functions can be used in multi-threading applications, with |
The PCRE functions can be used in multi-threading applications, with |
1016 |
the proviso that the memory management functions pointed to by |
the proviso that the memory management functions pointed to by |
1017 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
1018 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
1019 |
|
|
1020 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
1021 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
1022 |
at once. |
at once. |
1023 |
|
|
1025 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
1026 |
|
|
1027 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
1028 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
1029 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
1030 |
pcreprecompile documentation. However, compiling a regular expression |
pcreprecompile documentation. However, compiling a regular expression |
1031 |
with one version of PCRE for use with a different version is not guar- |
with one version of PCRE for use with a different version is not guar- |
1032 |
anteed to work and may cause crashes. |
anteed to work and may cause crashes. |
1033 |
|
|
1034 |
|
|
1036 |
|
|
1037 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
1038 |
|
|
1039 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
1040 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
1041 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
1042 |
tures. |
tures. |
1043 |
|
|
1044 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
1045 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
1046 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
1047 |
available: |
available: |
1048 |
|
|
1049 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
1050 |
|
|
1051 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
1052 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
1053 |
|
|
1054 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
1055 |
|
|
1056 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
1057 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
1058 |
|
|
1059 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
1060 |
|
|
1061 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
1062 |
sequence that is recognized as meaning "newline". The four values that |
sequence that is recognized as meaning "newline". The four values that |
1063 |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
1064 |
and -1 for ANY. Though they are derived from ASCII, the same values |
and -1 for ANY. Though they are derived from ASCII, the same values |
1065 |
are returned in EBCDIC environments. The default should normally corre- |
are returned in EBCDIC environments. The default should normally corre- |
1066 |
spond to the standard sequence for your operating system. |
spond to the standard sequence for your operating system. |
1067 |
|
|
1068 |
PCRE_CONFIG_BSR |
PCRE_CONFIG_BSR |
1069 |
|
|
1070 |
The output is an integer whose value indicates what character sequences |
The output is an integer whose value indicates what character sequences |
1071 |
the \R escape sequence matches by default. A value of 0 means that \R |
the \R escape sequence matches by default. A value of 0 means that \R |
1072 |
matches any Unicode line ending sequence; a value of 1 means that \R |
matches any Unicode line ending sequence; a value of 1 means that \R |
1073 |
matches only CR, LF, or CRLF. The default can be overridden when a pat- |
matches only CR, LF, or CRLF. The default can be overridden when a pat- |
1074 |
tern is compiled or matched. |
tern is compiled or matched. |
1075 |
|
|
1076 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
1077 |
|
|
1078 |
The output is an integer that contains the number of bytes used for |
The output is an integer that contains the number of bytes used for |
1079 |
internal linkage in compiled regular expressions. The value is 2, 3, or |
internal linkage in compiled regular expressions. The value is 2, 3, or |
1080 |
4. Larger values allow larger regular expressions to be compiled, at |
4. Larger values allow larger regular expressions to be compiled, at |
1081 |
the expense of slower matching. The default value of 2 is sufficient |
the expense of slower matching. The default value of 2 is sufficient |
1082 |
for all but the most massive patterns, since it allows the compiled |
for all but the most massive patterns, since it allows the compiled |
1083 |
pattern to be up to 64K in size. |
pattern to be up to 64K in size. |
1084 |
|
|
1085 |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
1086 |
|
|
1087 |
The output is an integer that contains the threshold above which the |
The output is an integer that contains the threshold above which the |
1088 |
POSIX interface uses malloc() for output vectors. Further details are |
POSIX interface uses malloc() for output vectors. Further details are |
1089 |
given in the pcreposix documentation. |
given in the pcreposix documentation. |
1090 |
|
|
1091 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
1092 |
|
|
1093 |
The output is a long integer that gives the default limit for the num- |
The output is a long integer that gives the default limit for the num- |
1094 |
ber of internal matching function calls in a pcre_exec() execution. |
ber of internal matching function calls in a pcre_exec() execution. |
1095 |
Further details are given with pcre_exec() below. |
Further details are given with pcre_exec() below. |
1096 |
|
|
1097 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
1098 |
|
|
1099 |
The output is a long integer that gives the default limit for the depth |
The output is a long integer that gives the default limit for the depth |
1100 |
of recursion when calling the internal matching function in a |
of recursion when calling the internal matching function in a |
1101 |
pcre_exec() execution. Further details are given with pcre_exec() |
pcre_exec() execution. Further details are given with pcre_exec() |
1102 |
below. |
below. |
1103 |
|
|
1104 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
1105 |
|
|
1106 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
1107 |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
1108 |
the stack to remember their state. This is the usual way that PCRE is |
the stack to remember their state. This is the usual way that PCRE is |
1109 |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
1110 |
on the heap instead of recursive function calls. In this case, |
on the heap instead of recursive function calls. In this case, |
1111 |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
1112 |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
1113 |
|
|
1114 |
|
|
1125 |
|
|
1126 |
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
1127 |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
1128 |
the two interfaces is that pcre_compile2() has an additional argument, |
the two interfaces is that pcre_compile2() has an additional argument, |
1129 |
errorcodeptr, via which a numerical error code can be returned. |
errorcodeptr, via which a numerical error code can be returned. |
1130 |
|
|
1131 |
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
1132 |
the pattern argument. A pointer to a single block of memory that is |
the pattern argument. A pointer to a single block of memory that is |
1133 |
obtained via pcre_malloc is returned. This contains the compiled code |
obtained via pcre_malloc is returned. This contains the compiled code |
1134 |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
1135 |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
1136 |
It is up to the caller to free the memory (via pcre_free) when it is no |
It is up to the caller to free the memory (via pcre_free) when it is no |
1137 |
longer required. |
longer required. |
1138 |
|
|
1139 |
Although the compiled code of a PCRE regex is relocatable, that is, it |
Although the compiled code of a PCRE regex is relocatable, that is, it |
1140 |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
1141 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
1142 |
ment, which is an address (see below). |
ment, which is an address (see below). |
1143 |
|
|
1144 |
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
1145 |
pilation. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
1146 |
options are described below. Some of them (in particular, those that |
options are described below. Some of them (in particular, those that |
1147 |
are compatible with Perl, but also some others) can also be set and |
are compatible with Perl, but also some others) can also be set and |
1148 |
unset from within the pattern (see the detailed description in the |
unset from within the pattern (see the detailed description in the |
1149 |
pcrepattern documentation). For those options that can be different in |
pcrepattern documentation). For those options that can be different in |
1150 |
different parts of the pattern, the contents of the options argument |
different parts of the pattern, the contents of the options argument |
1151 |
specifies their initial settings at the start of compilation and execu- |
specifies their initial settings at the start of compilation and execu- |
1152 |
tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
1153 |
time of matching as well as at compile time. |
time of matching as well as at compile time. |
1154 |
|
|
1155 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
1156 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
1157 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
1158 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
1159 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
1160 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
1161 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
1162 |
given. |
given. |
1163 |
|
|
1164 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
1165 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
1166 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
1167 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
1168 |
|
|
1169 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
1170 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
1171 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
1172 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
1173 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
1174 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
1175 |
support below. |
support below. |
1176 |
|
|
1177 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
1178 |
pile(): |
pile(): |
1179 |
|
|
1180 |
pcre *re; |
pcre *re; |
1187 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
1188 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
1189 |
|
|
1190 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
1191 |
file: |
file: |
1192 |
|
|
1193 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1194 |
|
|
1195 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
1196 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
1197 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
1198 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
1199 |
only way to do it in Perl. |
only way to do it in Perl. |
1200 |
|
|
1201 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
1202 |
|
|
1203 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
1204 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
1205 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
1206 |
|
|
1207 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
1208 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
1209 |
|
|
1210 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
1211 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
1212 |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
1213 |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
1214 |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
1215 |
|
|
1216 |
PCRE_CASELESS |
PCRE_CASELESS |
1217 |
|
|
1218 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
1219 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
1220 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
1221 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
1222 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
1223 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
1224 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
1225 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
1226 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
1227 |
UTF-8 support. |
UTF-8 support. |
1228 |
|
|
1229 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
1230 |
|
|
1231 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
1232 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
1233 |
matches immediately before a newline at the end of the string (but not |
matches immediately before a newline at the end of the string (but not |
1234 |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
1235 |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
1236 |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
1237 |
|
|
1238 |
PCRE_DOTALL |
PCRE_DOTALL |
1239 |
|
|
1240 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
1241 |
acters, including those that indicate newline. Without it, a dot does |
acters, including those that indicate newline. Without it, a dot does |
1242 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
1243 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
1244 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
1245 |
newline characters, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
1246 |
|
|
1247 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
1248 |
|
|
1249 |
If this bit is set, names used to identify capturing subpatterns need |
If this bit is set, names used to identify capturing subpatterns need |
1250 |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
1251 |
is known that only one instance of the named subpattern can ever be |
is known that only one instance of the named subpattern can ever be |
1252 |
matched. There are more details of named subpatterns below; see also |
matched. There are more details of named subpatterns below; see also |
1253 |
the pcrepattern documentation. |
the pcrepattern documentation. |
1254 |
|
|
1255 |
PCRE_EXTENDED |
PCRE_EXTENDED |
1256 |
|
|
1257 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
1258 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
1259 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
1260 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
1261 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
1262 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
1263 |
ting. |
ting. |
1264 |
|
|
1265 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
1266 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
1267 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
1268 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
1269 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
1270 |
|
|
1271 |
PCRE_EXTRA |
PCRE_EXTRA |
1272 |
|
|
1273 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
1274 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
1275 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
1276 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
1277 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
1278 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
1279 |
literal. (Perl can, however, be persuaded to give a warning for this.) |
literal. (Perl can, however, be persuaded to give a warning for this.) |
1280 |
There are at present no other features controlled by this option. It |
There are at present no other features controlled by this option. It |
1281 |
can also be set by a (?X) option setting within a pattern. |
can also be set by a (?X) option setting within a pattern. |
1282 |
|
|
1283 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
1284 |
|
|
1285 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
1286 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
1287 |
matched text may continue over the newline. |
matched text may continue over the newline. |
1288 |
|
|
1289 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
1290 |
|
|
1291 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
1292 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
1293 |
follows: |
follows: |
1294 |
|
|
1295 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
1296 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
1297 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
1298 |
option is set. |
option is set. |
1299 |
|
|
1300 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
1301 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
1302 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
1303 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
1304 |
default, for Perl compatibility. |
default, for Perl compatibility. |
1305 |
|
|
1306 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1307 |
|
|
1308 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1309 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
1310 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
1311 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
1312 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
1313 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
1314 |
|
|
1315 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
1316 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
1317 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
1318 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
1319 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
1320 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
1321 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1322 |
|
|
1323 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1326 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1327 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1328 |
|
|
1329 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1330 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1331 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1332 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1333 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1334 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
1335 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1336 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
1337 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1338 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1339 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
1340 |
UTF-8 mode. |
UTF-8 mode. |
1341 |
|
|
1342 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1343 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
1344 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
1345 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
1346 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1347 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1348 |
cause an error. |
cause an error. |
1349 |
|
|
1350 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
1351 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1352 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
1353 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
1354 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
1355 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
1356 |
and are therefore ignored. |
and are therefore ignored. |
1357 |
|
|
1361 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1362 |
|
|
1363 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
1364 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
1365 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
1366 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
1367 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
1368 |
|
|
1369 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
1370 |
|
|
1371 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
1372 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
1373 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
1374 |
within the pattern. |
within the pattern. |
1375 |
|
|
1376 |
PCRE_UTF8 |
PCRE_UTF8 |
1377 |
|
|
1378 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
1379 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
1380 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
1381 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
1382 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
1383 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
1384 |
|
|
1385 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1386 |
|
|
1387 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1388 |
automatically checked. There is a discussion about the validity of |
automatically checked. There is a discussion about the validity of |
1389 |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
1390 |
bytes is found, pcre_compile() returns an error. If you already know |
bytes is found, pcre_compile() returns an error. If you already know |
1391 |
that your pattern is valid, and you want to skip this check for perfor- |
that your pattern is valid, and you want to skip this check for perfor- |
1392 |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
1393 |
set, the effect of passing an invalid UTF-8 string as a pattern is |
set, the effect of passing an invalid UTF-8 string as a pattern is |
1394 |
undefined. It may cause your program to crash. Note that this option |
undefined. It may cause your program to crash. Note that this option |
1395 |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
1396 |
UTF-8 validity checking of subject strings. |
UTF-8 validity checking of subject strings. |
1397 |
|
|
1398 |
|
|
1399 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
1400 |
|
|
1401 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1402 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1403 |
both compiling functions. As PCRE has developed, some error codes have |
both compiling functions. As PCRE has developed, some error codes have |
1404 |
fallen out of use. To avoid confusion, they have not been re-used. |
fallen out of use. To avoid confusion, they have not been re-used. |
1405 |
|
|
1406 |
0 no error |
0 no error |
1456 |
50 [this code is not in use] |
50 [this code is not in use] |
1457 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
1458 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
1459 |
53 internal error: previously-checked referenced subpattern not |
53 internal error: previously-checked referenced subpattern not |
1460 |
found |
found |
1461 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
1462 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
1471 |
63 digit expected after (?+ |
63 digit expected after (?+ |
1472 |
64 ] is an invalid data character in JavaScript compatibility mode |
64 ] is an invalid data character in JavaScript compatibility mode |
1473 |
|
|
1474 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
1475 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
1476 |
|
|
1477 |
|
|
1480 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
1481 |
const char **errptr); |
const char **errptr); |
1482 |
|
|
1483 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
1484 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
1485 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
1486 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
1487 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
1488 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
1489 |
the results of the study. |
the results of the study. |
1490 |
|
|
1491 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
1492 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
1493 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
1494 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
1495 |
|
|
1496 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
1497 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
1498 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
1499 |
its own pcre_extra block. |
its own pcre_extra block. |
1500 |
|
|
1501 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
1502 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
1503 |
|
|
1504 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
1505 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
1506 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
1507 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
1508 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
1509 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
1510 |
|
|
1511 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
1517 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
1518 |
|
|
1519 |
At present, studying a pattern is useful only for non-anchored patterns |
At present, studying a pattern is useful only for non-anchored patterns |
1520 |
that do not have a single fixed starting character. A bitmap of possi- |
that do not have a single fixed starting character. A bitmap of possi- |
1521 |
ble starting bytes is created. |
ble starting bytes is created. |
1522 |
|
|
1523 |
|
|
1524 |
LOCALE SUPPORT |
LOCALE SUPPORT |
1525 |
|
|
1526 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
1527 |
letters, digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
1528 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1529 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
1530 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
1531 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
1532 |
code is discouraged. If you are handling characters with codes greater |
code is discouraged. If you are handling characters with codes greater |
1533 |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
1534 |
not try to mix the two. |
not try to mix the two. |
1535 |
|
|
1536 |
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
1537 |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
1538 |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
1539 |
acters. However, when PCRE is built, it is possible to cause the inter- |
acters. However, when PCRE is built, it is possible to cause the inter- |
1540 |
nal tables to be rebuilt in the default "C" locale of the local system, |
nal tables to be rebuilt in the default "C" locale of the local system, |
1541 |
which may cause them to be different. |
which may cause them to be different. |
1542 |
|
|
1543 |
The internal tables can always be overridden by tables supplied by the |
The internal tables can always be overridden by tables supplied by the |
1544 |
application that calls PCRE. These may be created in a different locale |
application that calls PCRE. These may be created in a different locale |
1545 |
from the default. As more and more applications change to using Uni- |
from the default. As more and more applications change to using Uni- |
1546 |
code, the need for this locale support is expected to die away. |
code, the need for this locale support is expected to die away. |
1547 |
|
|
1548 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
1549 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
1550 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1551 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
1552 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
1553 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1554 |
|
|
1555 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1556 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1557 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1558 |
|
|
1559 |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1560 |
if you are using Windows, the name for the French locale is "french". |
if you are using Windows, the name for the French locale is "french". |
1561 |
|
|
1562 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1563 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1564 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
1565 |
it is needed. |
it is needed. |
1566 |
|
|
1567 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
1568 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
1569 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
1570 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
1571 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
1572 |
|
|
1573 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
1574 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
1575 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
1576 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
1577 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
1578 |
|
|
1582 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1583 |
int what, void *where); |
int what, void *where); |
1584 |
|
|
1585 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
1586 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1587 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
1588 |
|
|
1589 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
1590 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
1591 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
1592 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
1593 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
1594 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
1595 |
|
|
1596 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1598 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1599 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
1600 |
|
|
1601 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
1602 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
1603 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
1604 |
pattern: |
pattern: |
1605 |
|
|
1606 |
int rc; |
int rc; |
1611 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
1612 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
1613 |
|
|
1614 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
1615 |
are as follows: |
are as follows: |
1616 |
|
|
1617 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
1618 |
|
|
1619 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
1620 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
1621 |
there are no back references. |
there are no back references. |
1622 |
|
|
1623 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
1624 |
|
|
1625 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
1626 |
argument should point to an int variable. |
argument should point to an int variable. |
1627 |
|
|
1628 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
1629 |
|
|
1630 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
1631 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
1632 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
1633 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
1634 |
passing a NULL table pointer. |
passing a NULL table pointer. |
1635 |
|
|
1636 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
1637 |
|
|
1638 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
1639 |
non-anchored pattern. The fourth argument should point to an int vari- |
non-anchored pattern. The fourth argument should point to an int vari- |
1640 |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
1641 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
1642 |
|
|
1643 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1644 |
(cat|cow|coyote), its value is returned. Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
1645 |
|
|
1646 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1647 |
branch starts with "^", or |
branch starts with "^", or |
1648 |
|
|
1649 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1650 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
1651 |
|
|
1652 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
1653 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
1654 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
1655 |
|
|
1656 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
1657 |
|
|
1658 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
1659 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
1660 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
1661 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1662 |
able. |
able. |
1663 |
|
|
1664 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
1665 |
|
|
1666 |
Return 1 if the pattern contains any explicit matches for CR or LF |
Return 1 if the pattern contains any explicit matches for CR or LF |
1667 |
characters, otherwise 0. The fourth argument should point to an int |
characters, otherwise 0. The fourth argument should point to an int |
1668 |
variable. An explicit match is either a literal CR or LF character, or |
variable. An explicit match is either a literal CR or LF character, or |
1669 |
\r or \n. |
\r or \n. |
1670 |
|
|
1671 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
1672 |
|
|
1673 |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
1674 |
otherwise 0. The fourth argument should point to an int variable. (?J) |
otherwise 0. The fourth argument should point to an int variable. (?J) |
1675 |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
1676 |
|
|
1677 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1678 |
|
|
1679 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1680 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
1681 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
1682 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
1683 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
1684 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1685 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
1686 |
|
|
1688 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
1689 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
1690 |
|
|
1691 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
1692 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
1693 |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
1694 |
pcre_get_named_substring() are provided for extracting captured sub- |
pcre_get_named_substring() are provided for extracting captured sub- |
1695 |
strings by name. It is also possible to extract the data directly, by |
strings by name. It is also possible to extract the data directly, by |
1696 |
first converting the name to a number in order to access the correct |
first converting the name to a number in order to access the correct |
1697 |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
1698 |
the conversion, you need to use the name-to-number map, which is |
the conversion, you need to use the name-to-number map, which is |
1699 |
described by these three values. |
described by these three values. |
1700 |
|
|
1701 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1702 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1703 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
1704 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1705 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
1706 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
1707 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
1708 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
1709 |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
1710 |
theses numbers. For example, consider the following pattern (assume |
theses numbers. For example, consider the following pattern (assume |
1711 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
1712 |
ignored): |
ignored): |
1713 |
|
|
1714 |
(?<date> (?<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
1715 |
(?<month>\d\d) - (?<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
1716 |
|
|
1717 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1718 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1719 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1720 |
as ??: |
as ??: |
1721 |
|
|
1724 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
1725 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
1726 |
|
|
1727 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
1728 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
1729 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
1730 |
|
|
1731 |
PCRE_INFO_OKPARTIAL |
PCRE_INFO_OKPARTIAL |
1732 |
|
|
1733 |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
Return 1 if the pattern can be used for partial matching with |
1734 |
The fourth argument should point to an int variable. From release 8.00, |
pcre_exec(), otherwise 0. The fourth argument should point to an int |
1735 |
this always returns 1, because the restrictions that previously applied |
variable. From release 8.00, this always returns 1, because the |
1736 |
to partial matching have been lifted. The pcrepartial documentation |
restrictions that previously applied to partial matching have been |
1737 |
gives details of partial matching. |
lifted. The pcrepartial documentation gives details of partial match- |
1738 |
|
ing. |
1739 |
|
|
1740 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1741 |
|
|
1921 |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
1922 |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
1923 |
|
|
1924 |
The pcre_callout field is used in conjunction with the "callout" fea- |
The callout_data field is used in conjunction with the "callout" fea- |
1925 |
ture, which is described in the pcrecallout documentation. |
ture, and is described in the pcrecallout documentation. |
1926 |
|
|
1927 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
1928 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
1939 |
|
|
1940 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1941 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1942 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
1943 |
PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD. |
PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and |
1944 |
|
PCRE_PARTIAL_HARD. |
1945 |
|
|
1946 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1947 |
|
|
1948 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
1949 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
1950 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
1951 |
unachored at matching time. |
unachored at matching time. |
1952 |
|
|
1953 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
1954 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
1955 |
|
|
1956 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
1957 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
1958 |
or to match any Unicode newline sequence. These options override the |
or to match any Unicode newline sequence. These options override the |
1959 |
choice that was made or defaulted when the pattern was compiled. |
choice that was made or defaulted when the pattern was compiled. |
1960 |
|
|
1961 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1964 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1965 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1966 |
|
|
1967 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
1968 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
1969 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
1970 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
1971 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
1972 |
match failure for an unanchored pattern. |
match failure for an unanchored pattern. |
1973 |
|
|
1974 |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
1975 |
set, and a match attempt for an unanchored pattern fails when the cur- |
set, and a match attempt for an unanchored pattern fails when the cur- |
1976 |
rent position is at a CRLF sequence, and the pattern contains no |
rent position is at a CRLF sequence, and the pattern contains no |
1977 |
explicit matches for CR or LF characters, the match position is |
explicit matches for CR or LF characters, the match position is |
1978 |
advanced by two characters instead of one, in other words, to after the |
advanced by two characters instead of one, in other words, to after the |
1979 |
CRLF. |
CRLF. |
1980 |
|
|
1981 |
The above rule is a compromise that makes the most common cases work as |
The above rule is a compromise that makes the most common cases work as |
1982 |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
1983 |
option is not set), it does not match the string "\r\nA" because, after |
option is not set), it does not match the string "\r\nA" because, after |
1984 |
failing at the start, it skips both the CR and the LF before retrying. |
failing at the start, it skips both the CR and the LF before retrying. |
1985 |
However, the pattern [\r\n]A does match that string, because it con- |
However, the pattern [\r\n]A does match that string, because it con- |
1986 |
tains an explicit CR or LF reference, and so advances only by one char- |
tains an explicit CR or LF reference, and so advances only by one char- |
1987 |
acter after the first failure. |
acter after the first failure. |
1988 |
|
|
1989 |
An explicit match for CR of LF is either a literal appearance of one of |
An explicit match for CR of LF is either a literal appearance of one of |
1990 |
those characters, or one of the \r or \n escape sequences. Implicit |
those characters, or one of the \r or \n escape sequences. Implicit |
1991 |
matches such as [^X] do not count, nor does \s (which includes CR and |
matches such as [^X] do not count, nor does \s (which includes CR and |
1992 |
LF in the characters that it matches). |
LF in the characters that it matches). |
1993 |
|
|
1994 |
Notwithstanding the above, anomalous effects may still occur when CRLF |
Notwithstanding the above, anomalous effects may still occur when CRLF |
1995 |
is a valid newline sequence and explicit \r or \n escapes appear in the |
is a valid newline sequence and explicit \r or \n escapes appear in the |
1996 |
pattern. |
pattern. |
1997 |
|
|
1998 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1999 |
|
|
2000 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
2001 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
2002 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
2003 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
2004 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
2005 |
|
|
2006 |
PCRE_NOTEOL |
PCRE_NOTEOL |
2007 |
|
|
2008 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
2009 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
2010 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
2011 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
2012 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
2013 |
not affect \Z or \z. |
not affect \Z or \z. |
2014 |
|
|
2015 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
2016 |
|
|
2017 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
2018 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
2019 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
2020 |
example, if the pattern |
example, if the pattern |
2021 |
|
|
2022 |
a?b? |
a?b? |
2023 |
|
|
2024 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches an |
2025 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
2026 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
2027 |
rences of "a" or "b". |
rences of "a" or "b". |
2028 |
|
|
2029 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
PCRE_NOTEMPTY_ATSTART |
2030 |
cial case of a pattern match of the empty string within its split() |
|
2031 |
function, and when using the /g modifier. It is possible to emulate |
This is like PCRE_NOTEMPTY, except that an empty string match that is |
2032 |
Perl's behaviour after matching a null string by first trying the match |
not at the start of the subject is permitted. If the pattern is |
2033 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
anchored, such a match can occur only if the pattern contains \K. |
2034 |
if that fails by advancing the starting offset (see below) and trying |
|
2035 |
an ordinary match again. There is some code that demonstrates how to do |
Perl has no direct equivalent of PCRE_NOTEMPTY or |
2036 |
this in the pcredemo sample program. |
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern |
2037 |
|
match of the empty string within its split() function, and when using |
2038 |
|
the /g modifier. It is possible to emulate Perl's behaviour after |
2039 |
|
matching a null string by first trying the match again at the same off- |
2040 |
|
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that |
2041 |
|
fails, by advancing the starting offset (see below) and trying an ordi- |
2042 |
|
nary match again. There is some code that demonstrates how to do this |
2043 |
|
in the pcredemo sample program. |
2044 |
|
|
2045 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
2046 |
|
|
2086 |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
2087 |
matching continues by testing any other alternatives. Only if they all |
matching continues by testing any other alternatives. Only if they all |
2088 |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
2089 |
The portion of the string that provided the partial match is set as the |
The portion of the string that was inspected when the partial match was |
2090 |
first matching string. There is a more detailed discussion in the |
found is set as the first matching string. There is a more detailed |
2091 |
pcrepartial documentation. |
discussion in the pcrepartial documentation. |
2092 |
|
|
2093 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2094 |
|
|
2504 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2505 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2506 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2507 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, and a list of features |
2508 |
mentation. |
that pcre_dfa_exec() does not support, see the pcrematching documenta- |
2509 |
|
tion. |
2510 |
|
|
2511 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2512 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2513 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2514 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2515 |
repeated here. |
repeated here. |
2516 |
|
|
2517 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2518 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2519 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2520 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2521 |
lot of potential matches. |
lot of potential matches. |
2522 |
|
|
2523 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2539 |
|
|
2540 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2541 |
|
|
2542 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2543 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2544 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
2545 |
PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and |
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR- |
2546 |
PCRE_DFA_RESTART. All but the last four of these are exactly the same |
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2547 |
as for pcre_exec(), so their description is not repeated here. |
four of these are exactly the same as for pcre_exec(), so their |
2548 |
|
description is not repeated here. |
2549 |
|
|
2550 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
2551 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
2559 |
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end |
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end |
2560 |
of the subject is reached, there have been no complete matches, but |
of the subject is reached, there have been no complete matches, but |
2561 |
there is still at least one matching possibility. The portion of the |
there is still at least one matching possibility. The portion of the |
2562 |
string that provided the longest partial match is set as the first |
string that was inspected when the longest partial match was found is |
2563 |
matching string in both cases. |
set as the first matching string in both cases. |
2564 |
|
|
2565 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2566 |
|
|
2666 |
|
|
2667 |
REVISION |
REVISION |
2668 |
|
|
2669 |
Last updated: 01 September 2009 |
Last updated: 11 September 2009 |
2670 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2671 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2672 |
|
|
2891 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
2892 |
can be tested with \p and \P are limited to the general category prop- |
can be tested with \p and \P are limited to the general category prop- |
2893 |
erties such as Lu and Nd, script names such as Greek or Han, and the |
erties such as Lu and Nd, script names such as Greek or Han, and the |
2894 |
derived properties Any and L&. |
derived properties Any and L&. PCRE does support the Cs (surrogate) |
2895 |
|
property, which Perl does not; the Perl documentation says "Because |
2896 |
|
Perl hides the need for the user to understand the internal representa- |
2897 |
|
tion of Unicode characters, there is no need to implement the somewhat |
2898 |
|
messy concept of surrogates." |
2899 |
|
|
2900 |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
2901 |
ters in between are treated as literals. This is slightly different |
ters in between are treated as literals. This is slightly different |
2915 |
|
|
2916 |
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
2917 |
constructions. However, there is support for recursive patterns. This |
constructions. However, there is support for recursive patterns. This |
2918 |
is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE |
is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE |
2919 |
"callout" feature allows an external function to be called during pat- |
"callout" feature allows an external function to be called during pat- |
2920 |
tern matching. See the pcrecallout documentation for details. |
tern matching. See the pcrecallout documentation for details. |
2921 |
|
|
2922 |
9. Subpatterns that are called recursively or as "subroutines" are |
9. Subpatterns that are called recursively or as "subroutines" are |
2923 |
always treated as atomic groups in PCRE. This is like Python, but |
always treated as atomic groups in PCRE. This is like Python, but |
2924 |
unlike Perl. |
unlike Perl. There is a discussion of an example that explains this in |
2925 |
|
more detail in the section on recursion differences from Perl in the |
2926 |
|
pcrecompat page. |
2927 |
|
|
2928 |
10. There are some differences that are concerned with the settings of |
10. There are some differences that are concerned with the settings of |
2929 |
captured strings when part of a pattern is repeated. For example, |
captured strings when part of a pattern is repeated. For example, |
2932 |
|
|
2933 |
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), |
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), |
2934 |
(*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in |
(*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in |
2935 |
the forms without an argument. PCRE does not support (*MARK). If |
the forms without an argument. PCRE does not support (*MARK). |
|
(*ACCEPT) is within capturing parentheses, PCRE does not set that cap- |
|
|
ture group; this is different to Perl. |
|
2936 |
|
|
2937 |
12. PCRE provides some extensions to the Perl regular expression facil- |
12. PCRE provides some extensions to the Perl regular expression facil- |
2938 |
ities. Perl 5.10 will include new features that are not in earlier |
ities. Perl 5.10 will include new features that are not in earlier |
2957 |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
2958 |
tried only at the first matching position in the subject string. |
tried only at the first matching position in the subject string. |
2959 |
|
|
2960 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
2961 |
TURE options for pcre_exec() have no Perl equivalents. |
and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva- |
2962 |
|
lents. |
2963 |
|
|
2964 |
(g) The \R escape sequence can be restricted to match only CR, LF, or |
(g) The \R escape sequence can be restricted to match only CR, LF, or |
2965 |
CRLF by the PCRE_BSR_ANYCRLF option. |
CRLF by the PCRE_BSR_ANYCRLF option. |
2966 |
|
|
2967 |
(h) The callout facility is PCRE-specific. |
(h) The callout facility is PCRE-specific. |
2971 |
(j) Patterns compiled by PCRE can be saved and re-used at a later time, |
(j) Patterns compiled by PCRE can be saved and re-used at a later time, |
2972 |
even on different hosts that have the other endianness. |
even on different hosts that have the other endianness. |
2973 |
|
|
2974 |
(k) The alternative matching function (pcre_dfa_exec()) matches in a |
(k) The alternative matching function (pcre_dfa_exec()) matches in a |
2975 |
different way and is not Perl-compatible. |
different way and is not Perl-compatible. |
2976 |
|
|
2977 |
(l) PCRE recognizes some special sequences such as (*CR) at the start |
(l) PCRE recognizes some special sequences such as (*CR) at the start |
2978 |
of a pattern that set overall options that cannot be changed within the |
of a pattern that set overall options that cannot be changed within the |
2979 |
pattern. |
pattern. |
2980 |
|
|
2988 |
|
|
2989 |
REVISION |
REVISION |
2990 |
|
|
2991 |
Last updated: 25 August 2009 |
Last updated: 18 September 2009 |
2992 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2993 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2994 |
|
|
3507 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3508 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3509 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3510 |
the pcreapi page). |
the pcreapi page). Perl does not support the Cs property. |
3511 |
|
|
3512 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for property names that Perl supports (such as |
3513 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3514 |
any of these properties with "Is". |
any of these properties with "Is". |
3515 |
|
|
4734 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
4735 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
4736 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
4737 |
PCRE and Python, this kind of recursion was introduced into Perl at |
PCRE and Python, this kind of recursion was subsequently introduced |
4738 |
release 5.10. |
into Perl at release 5.10. |
4739 |
|
|
4740 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
4741 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
4744 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
4745 |
regular expression. |
regular expression. |
4746 |
|
|
4747 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
This PCRE pattern solves the nested parentheses problem (assume the |
|
always treated as an atomic group. That is, once it has matched some of |
|
|
the subject string, it is never re-entered, even if it contains untried |
|
|
alternatives and there is a subsequent matching failure. |
|
|
|
|
|
This PCRE pattern solves the nested parentheses problem (assume the |
|
4748 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
4749 |
|
|
4750 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
4751 |
|
|
4752 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
4753 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
4754 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
4755 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
4756 |
|
|
4757 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
4758 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
4759 |
|
|
4760 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
4761 |
|
|
4762 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
4763 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
4764 |
|
|
4765 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
4766 |
tricky. This is made easier by the use of relative references. (A Perl |
tricky. This is made easier by the use of relative references. (A Perl |
4767 |
5.10 feature.) Instead of (?1) in the pattern above you can write |
5.10 feature.) Instead of (?1) in the pattern above you can write |
4768 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
4769 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
4770 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
4771 |
|
|
4772 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
4773 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
4774 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
4775 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
4776 |
section. |
section. |
4777 |
|
|
4778 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
4779 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
4780 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
4781 |
|
|
4782 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
4783 |
|
|
4784 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
4785 |
one is used. |
one is used. |
4786 |
|
|
4787 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
4788 |
nested unlimited repeats, and so the use of atomic grouping for match- |
nested unlimited repeats, and so the use of atomic grouping for match- |
4789 |
ing strings of non-parentheses is important when applying the pattern |
ing strings of non-parentheses is important when applying the pattern |
4790 |
to strings that do not match. For example, when this pattern is applied |
to strings that do not match. For example, when this pattern is applied |
4791 |
to |
to |
4792 |
|
|
4793 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
4794 |
|
|
4795 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
4796 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
4797 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
4798 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
4799 |
|
|
4800 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
4801 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
4802 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
4803 |
function can be used (see below and the pcrecallout documentation). If |
function can be used (see below and the pcrecallout documentation). If |
4804 |
the pattern above is matched against |
the pattern above is matched against |
4805 |
|
|
4806 |
(ab(cd)ef) |
(ab(cd)ef) |
4807 |
|
|
4808 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
4809 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
4810 |
giving |
giving |
4811 |
|
|
4812 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
4813 |
^ ^ |
^ ^ |
4814 |
^ ^ |
^ ^ |
4815 |
|
|
4816 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
4817 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
4818 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
4819 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
4820 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
4821 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
4822 |
|
|
4823 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
4824 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
4825 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
4826 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
4827 |
ted at the outer level. |
ted at the outer level. |
4828 |
|
|
4829 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
4830 |
|
|
4831 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
4832 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
4833 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
4834 |
|
|
4835 |
|
Recursion difference from Perl |
4836 |
|
|
4837 |
|
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
4838 |
|
always treated as an atomic group. That is, once it has matched some of |
4839 |
|
the subject string, it is never re-entered, even if it contains untried |
4840 |
|
alternatives and there is a subsequent matching failure. This can be |
4841 |
|
illustrated by the following pattern, which purports to match a palin- |
4842 |
|
dromic string that contains an odd number of characters (for example, |
4843 |
|
"a", "aba", "abcba", "abcdcba"): |
4844 |
|
|
4845 |
|
^(.|(.)(?1)\2)$ |
4846 |
|
|
4847 |
|
The idea is that it either matches a single character, or two identical |
4848 |
|
characters surrounding a sub-palindrome. In Perl, this pattern works; |
4849 |
|
in PCRE it does not if the pattern is longer than three characters. |
4850 |
|
Consider the subject string "abcba": |
4851 |
|
|
4852 |
|
At the top level, the first character is matched, but as it is not at |
4853 |
|
the end of the string, the first alternative fails; the second alterna- |
4854 |
|
tive is taken and the recursion kicks in. The recursive call to subpat- |
4855 |
|
tern 1 successfully matches the next character ("b"). (Note that the |
4856 |
|
beginning and end of line tests are not part of the recursion). |
4857 |
|
|
4858 |
|
Back at the top level, the next character ("c") is compared with what |
4859 |
|
subpattern 2 matched, which was "a". This fails. Because the recursion |
4860 |
|
is treated as an atomic group, there are now no backtracking points, |
4861 |
|
and so the entire match fails. (Perl is able, at this point, to re- |
4862 |
|
enter the recursion and try the second alternative.) However, if the |
4863 |
|
pattern is written with the alternatives in the other order, things are |
4864 |
|
different: |
4865 |
|
|
4866 |
|
^((.)(?1)\2|.)$ |
4867 |
|
|
4868 |
|
This time, the recursing alternative is tried first, and continues to |
4869 |
|
recurse until it runs out of characters, at which point the recursion |
4870 |
|
fails. But this time we do have another alternative to try at the |
4871 |
|
higher level. That is the big difference: in the previous case the |
4872 |
|
remaining alternative is at a deeper recursion level, which PCRE cannot |
4873 |
|
use. |
4874 |
|
|
4875 |
|
To change the pattern so that matches all palindromic strings, not just |
4876 |
|
those with an odd number of characters, it is tempting to change the |
4877 |
|
pattern to this: |
4878 |
|
|
4879 |
|
^((.)(?1)\2|.?)$ |
4880 |
|
|
4881 |
|
Again, this works in Perl, but not in PCRE, and for the same reason. |
4882 |
|
When a deeper recursion has matched a single character, it cannot be |
4883 |
|
entered again in order to match an empty string. The solution is to |
4884 |
|
separate the two cases, and write out the odd and even cases as alter- |
4885 |
|
natives at the higher level: |
4886 |
|
|
4887 |
|
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
4888 |
|
|
4889 |
|
If you want to match typical palindromic phrases, the pattern has to |
4890 |
|
ignore all non-word characters, which can be done like this: |
4891 |
|
|
4892 |
|
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$ |
4893 |
|
|
4894 |
|
If run with the PCRE_CASELESS option, this pattern matches phrases such |
4895 |
|
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
4896 |
|
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
4897 |
|
ing into sequences of non-word characters. Without this, PCRE takes a |
4898 |
|
great deal longer (ten times or more) to match typical phrases, and |
4899 |
|
Perl takes so long that you think it has gone into a loop. |
4900 |
|
|
4901 |
|
|
4902 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
4903 |
|
|
4904 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
4905 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
4906 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
4907 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
4908 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
4909 |
|
|
4915 |
|
|
4916 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
4917 |
|
|
4918 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
4919 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
4920 |
|
|
4921 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
4922 |
|
|
4923 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
4924 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
4925 |
above. |
above. |
4926 |
|
|
4927 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
4928 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
4929 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
4930 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
4931 |
|
|
4932 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
4933 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
4934 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
4935 |
|
|
4936 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
4937 |
|
|
4938 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
4939 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
4940 |
|
|
4941 |
|
|
4942 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
4943 |
|
|
4944 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
4945 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
4946 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
4947 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
4948 |
ten using this syntax: |
ten using this syntax: |
4949 |
|
|
4950 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
4951 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
4952 |
|
|
4953 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
4954 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
4955 |
|
|
4956 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
4957 |
|
|
4958 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
4959 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
4960 |
call. |
call. |
4961 |
|
|
4962 |
|
|
4963 |
CALLOUTS |
CALLOUTS |
4964 |
|
|
4965 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
4966 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
4967 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
4968 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
4969 |
tion. |
tion. |
4970 |
|
|
4971 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
4972 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
4973 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
4974 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
4975 |
all calling out. |
all calling out. |
4976 |
|
|
4977 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
4978 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
4979 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
4980 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
4981 |
points: |
points: |
4982 |
|
|
4983 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
4984 |
|
|
4985 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
4986 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
4987 |
numbered 255. |
numbered 255. |
4988 |
|
|
4989 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
4990 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
4991 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
4992 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
4993 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
4994 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
4995 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
4996 |
|
|
4997 |
|
|
4998 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
4999 |
|
|
5000 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
5001 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
5002 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
5003 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
5004 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
5005 |
in this section. |
in this section. |
5006 |
|
|
5007 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
5008 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
5009 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
5010 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
5011 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
5012 |
|
|
5013 |
|
If any of these verbs are used in an assertion subpattern, their effect |
5014 |
|
is confined to that subpattern; it does not extend to the surrounding |
5015 |
|
pattern. Note that assertion subpatterns are processed as anchored at |
5016 |
|
the point where they are tested. |
5017 |
|
|
5018 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
5019 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
5020 |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
5029 |
|
|
5030 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
5031 |
of the pattern. When inside a recursion, only the innermost pattern is |
of the pattern. When inside a recursion, only the innermost pattern is |
5032 |
ended immediately. PCRE differs from Perl in what happens if the |
ended immediately. If the (*ACCEPT) is inside capturing parentheses, |
5033 |
(*ACCEPT) is inside capturing parentheses. In Perl, the data so far is |
the data so far is captured. (This feature was added to PCRE at release |
5034 |
captured: in PCRE no data is captured. For example: |
8.00.) For example: |
5035 |
|
|
5036 |
A(A|B(*ACCEPT)|C)D |
A((?:A|B(*ACCEPT)|C)D) |
5037 |
|
|
5038 |
This matches "AB", "AAD", or "ACD", but when it matches "AB", no data |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
5039 |
is captured. |
tured by the outer parentheses. |
5040 |
|
|
5041 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
5042 |
|
|
5132 |
|
|
5133 |
REVISION |
REVISION |
5134 |
|
|
5135 |
Last updated: 11 April 2009 |
Last updated: 18 September 2009 |
5136 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5137 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5138 |
|
|
5546 |
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but |
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but |
5547 |
matching continues as normal, and other alternatives in the pattern are |
matching continues as normal, and other alternatives in the pattern are |
5548 |
tried. If no complete match can be found, pcre_exec() returns |
tried. If no complete match can be found, pcre_exec() returns |
5549 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH, and if there are at |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least |
5550 |
least two slots in the offsets vector, they are filled in with the off- |
two slots in the offsets vector, the first of them is set to the offset |
5551 |
sets of the longest string that partially matched. Consider this pat- |
of the earliest character that was inspected when the partial match was |
5552 |
tern: |
found. For convenience, the second offset points to the end of the |
5553 |
|
string so that a substring can easily be extracted. |
5554 |
|
|
5555 |
|
For the majority of patterns, the first offset identifies the start of |
5556 |
|
the partially matched string. However, for patterns that contain look- |
5557 |
|
behind assertions, or \K, or begin with \b or \B, earlier characters |
5558 |
|
have been inspected while carrying out the match. For example: |
5559 |
|
|
5560 |
|
/(?<=abc)123/ |
5561 |
|
|
5562 |
|
This pattern matches "123", but only if it is preceded by "abc". If the |
5563 |
|
subject string is "xyzabc12", the offsets after a partial match are for |
5564 |
|
the substring "abc12", because all these characters are needed if |
5565 |
|
another match is tried with extra characters added. |
5566 |
|
|
5567 |
|
If there is more than one partial match, the first one that was found |
5568 |
|
provides the data that is returned. Consider this pattern: |
5569 |
|
|
5570 |
/123\w+X|dogY/ |
/123\w+X|dogY/ |
5571 |
|
|
5573 |
natives fail to match, but the end of the subject is reached during |
natives fail to match, but the end of the subject is reached during |
5574 |
matching, so PCRE_ERROR_PARTIAL is returned instead of |
matching, so PCRE_ERROR_PARTIAL is returned instead of |
5575 |
PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying |
PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying |
5576 |
"123dog" as the longest partial match that was found. (In this example, |
"123dog" as the first partial match that was found. (In this example, |
5577 |
there are two partial matches, because "dog" on its own partially |
there are two partial matches, because "dog" on its own partially |
5578 |
matches the second alternative.) |
matches the second alternative.) |
5579 |
|
|
5617 |
there have been no complete matches. Otherwise, the complete matches |
there have been no complete matches. Otherwise, the complete matches |
5618 |
are returned. However, if PCRE_PARTIAL_HARD is set, a partial match |
are returned. However, if PCRE_PARTIAL_HARD is set, a partial match |
5619 |
takes precedence over any complete matches. The portion of the string |
takes precedence over any complete matches. The portion of the string |
5620 |
that provided the longest partial match is set as the first matching |
that was inspected when the longest partial match was found is set as |
5621 |
string, provided there are at least two slots in the offsets vector. |
the first matching string, provided there are at least two slots in the |
5622 |
|
offsets vector. |
5623 |
|
|
5624 |
Because pcre_dfa_exec() always searches for all possible matches, and |
Because pcre_dfa_exec() always searches for all possible matches, and |
5625 |
there is no difference between greedy and ungreedy repetition, its be- |
there is no difference between greedy and ungreedy repetition, its be- |
5626 |
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con- |
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con- |
5627 |
sider the string "dog" matched against the ungreedy pattern shown |
sider the string "dog" matched against the ungreedy pattern shown |
5628 |
above: |
above: |
5629 |
|
|
5630 |
/dog(sbody)??/ |
/dog(sbody)??/ |
5631 |
|
|
5632 |
Whereas pcre_exec() stops as soon as it finds the complete match for |
Whereas pcre_exec() stops as soon as it finds the complete match for |
5633 |
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and |
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and |
5634 |
so returns that when PCRE_PARTIAL_HARD is set. |
so returns that when PCRE_PARTIAL_HARD is set. |
5635 |
|
|
5636 |
|
|
5637 |
PARTIAL MATCHING AND WORD BOUNDARIES |
PARTIAL MATCHING AND WORD BOUNDARIES |
5638 |
|
|
5639 |
If a pattern ends with one of sequences \w or \W, which test for word |
If a pattern ends with one of sequences \w or \W, which test for word |
5640 |
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- |
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- |
5641 |
intuitive results. Consider this pattern: |
intuitive results. Consider this pattern: |
5642 |
|
|
5643 |
/\bcat\b/ |
/\bcat\b/ |
5644 |
|
|
5645 |
This matches "cat", provided there is a word boundary at either end. If |
This matches "cat", provided there is a word boundary at either end. If |
5646 |
the subject string is "the cat", the comparison of the final "t" with a |
the subject string is "the cat", the comparison of the final "t" with a |
5647 |
following character cannot take place, so a partial match is found. |
following character cannot take place, so a partial match is found. |
5648 |
However, pcre_exec() carries on with normal matching, which matches \b |
However, pcre_exec() carries on with normal matching, which matches \b |
5649 |
at the end of the subject when the last character is a letter, thus |
at the end of the subject when the last character is a letter, thus |
5650 |
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR- |
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR- |
5651 |
TIAL. The same thing happens with pcre_dfa_exec(), because it also |
TIAL. The same thing happens with pcre_dfa_exec(), because it also |
5652 |
finds the complete match. |
finds the complete match. |
5653 |
|
|
5654 |
Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, |
Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, |
5655 |
because then the partial match takes precedence. |
because then the partial match takes precedence. |
5656 |
|
|
5657 |
|
|
5658 |
FORMERLY RESTRICTED PATTERNS |
FORMERLY RESTRICTED PATTERNS |
5659 |
|
|
5660 |
For releases of PCRE prior to 8.00, because of the way certain internal |
For releases of PCRE prior to 8.00, because of the way certain internal |
5661 |
optimizations were implemented in the pcre_exec() function, the |
optimizations were implemented in the pcre_exec() function, the |
5662 |
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be |
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be |
5663 |
used with all patterns. From release 8.00 onwards, the restrictions no |
used with all patterns. From release 8.00 onwards, the restrictions no |
5664 |
longer apply, and partial matching with pcre_exec() can be requested |
longer apply, and partial matching with pcre_exec() can be requested |
5665 |
for any pattern. |
for any pattern. |
5666 |
|
|
5667 |
Items that were formerly restricted were repeated single characters and |
Items that were formerly restricted were repeated single characters and |
5668 |
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did |
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did |
5669 |
not conform to the restrictions, pcre_exec() returned the error code |
not conform to the restrictions, pcre_exec() returned the error code |
5670 |
PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The |
PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The |
5671 |
PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled |
PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled |
5672 |
pattern can be used for partial matching now always returns 1. |
pattern can be used for partial matching now always returns 1. |
5673 |
|
|
5674 |
|
|
5675 |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
5676 |
|
|
5677 |
If the escape sequence \P is present in a pcretest data line, the |
If the escape sequence \P is present in a pcretest data line, the |
5678 |
PCRE_PARTIAL_SOFT option is used for the match. Here is a run of |
PCRE_PARTIAL_SOFT option is used for the match. Here is a run of |
5679 |
pcretest that uses the date example quoted above: |
pcretest that uses the date example quoted above: |
5680 |
|
|
5681 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
5691 |
data> j\P |
data> j\P |
5692 |
No match |
No match |
5693 |
|
|
5694 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
5695 |
matched substrings. The remaining four strings do not match the com- |
matched substrings. The remaining four strings do not match the com- |
5696 |
plete pattern, but the first two are partial matches. Similar output is |
plete pattern, but the first two are partial matches. Similar output is |
5697 |
obtained when pcre_dfa_exec() is used. |
obtained when pcre_dfa_exec() is used. |
5698 |
|
|
5699 |
If the escape sequence \P is present more than once in a pcretest data |
If the escape sequence \P is present more than once in a pcretest data |
5700 |
line, the PCRE_PARTIAL_HARD option is set for the match. |
line, the PCRE_PARTIAL_HARD option is set for the match. |
5701 |
|
|
5702 |
|
|
5703 |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
5704 |
|
|
5705 |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
5706 |
ble to continue the match by providing additional subject data and |
ble to continue the match by providing additional subject data and |
5707 |
calling pcre_dfa_exec() again with the same compiled regular expres- |
calling pcre_dfa_exec() again with the same compiled regular expres- |
5708 |
sion, this time setting the PCRE_DFA_RESTART option. You must pass the |
sion, this time setting the PCRE_DFA_RESTART option. You must pass the |
5709 |
same working space as before, because this is where details of the pre- |
same working space as before, because this is where details of the pre- |
5710 |
vious partial match are stored. Here is an example using pcretest, |
vious partial match are stored. Here is an example using pcretest, |
5711 |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\D |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\D |
5712 |
specifies the use of pcre_dfa_exec()): |
specifies the use of pcre_dfa_exec()): |
5713 |
|
|
5714 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
5717 |
data> n05\R\D |
data> n05\R\D |
5718 |
0: n05 |
0: n05 |
5719 |
|
|
5720 |
The first call has "23ja" as the subject, and requests partial match- |
The first call has "23ja" as the subject, and requests partial match- |
5721 |
ing; the second call has "n05" as the subject for the continued |
ing; the second call has "n05" as the subject for the continued |
5722 |
(restarted) match. Notice that when the match is complete, only the |
(restarted) match. Notice that when the match is complete, only the |
5723 |
last part is shown; PCRE does not retain the previously partially- |
last part is shown; PCRE does not retain the previously partially- |
5724 |
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
5725 |
to. |
to. |
5726 |
|
|
5727 |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
5728 |
PCRE_DFA_RESTART to continue partial matching over multiple segments. |
PCRE_DFA_RESTART to continue partial matching over multiple segments. |
5729 |
This facility can be used to pass very long subject strings to |
This facility can be used to pass very long subject strings to |
5730 |
pcre_dfa_exec(). |
pcre_dfa_exec(). |
5731 |
|
|
5732 |
|
|
5733 |
MULTI-SEGMENT MATCHING WITH pcre_exec() |
MULTI-SEGMENT MATCHING WITH pcre_exec() |
5734 |
|
|
5735 |
From release 8.00, pcre_exec() can also be used to do multi-segment |
From release 8.00, pcre_exec() can also be used to do multi-segment |
5736 |
matching. Unlike pcre_dfa_exec(), it is not possible to restart the |
matching. Unlike pcre_dfa_exec(), it is not possible to restart the |
5737 |
previous match with a new segment of data. Instead, new data must be |
previous match with a new segment of data. Instead, new data must be |
5738 |
added to the previous subject string, and the entire match re-run, |
added to the previous subject string, and the entire match re-run, |
5739 |
starting from the point where the partial match occurred. Earlier data |
starting from the point where the partial match occurred. Earlier data |
5740 |
can be discarded. Consider an unanchored pattern that matches dates: |
can be discarded. Consider an unanchored pattern that matches dates: |
5741 |
|
|
5742 |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
5744 |
Partial match: 23ja |
Partial match: 23ja |
5745 |
|
|
5746 |
The this stage, an application could discard the text preceding "23ja", |
The this stage, an application could discard the text preceding "23ja", |
5747 |
add on text from the next segment, and call pcre_exec() again. Unlike |
add on text from the next segment, and call pcre_exec() again. Unlike |
5748 |
pcre_dfa_exec(), the entire matching string must always be available, |
pcre_dfa_exec(), the entire matching string must always be available, |
5749 |
and the complete matching process occurs for each call, so more memory |
and the complete matching process occurs for each call, so more memory |
5750 |
and more processing time is needed. |
and more processing time is needed. |
5751 |
|
|
5752 |
|
Note: If the pattern contains lookbehind assertions, or \K, or starts |
5753 |
|
with \b or \B, the string that is returned for a partial match will |
5754 |
|
include characters that precede the partially matched string itself, |
5755 |
|
because these must be retained when adding on more characters for a |
5756 |
|
subsequent matching attempt. |
5757 |
|
|
5758 |
|
|
5759 |
ISSUES WITH MULTI-SEGMENT MATCHING |
ISSUES WITH MULTI-SEGMENT MATCHING |
5760 |
|
|
5761 |
Certain types of pattern may give problems with multi-segment matching, |
Certain types of pattern may give problems with multi-segment matching, |
5762 |
whichever matching function is used. |
whichever matching function is used. |
5763 |
|
|
5764 |
1. If the pattern contains tests for the beginning or end of a line, |
1. If the pattern contains tests for the beginning or end of a line, |
5765 |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
5766 |
ate, when the subject string for any call does not contain the begin- |
ate, when the subject string for any call does not contain the begin- |
5767 |
ning or end of a line. |
ning or end of a line. |
5768 |
|
|
5769 |
2. If the pattern contains backward assertions (including \b or \B), |
2. Lookbehind assertions at the start of a pattern are catered for in |
5770 |
you need to arrange for some overlap in the subject strings to allow |
the offsets that are returned for a partial match. However, in theory, |
5771 |
for them to be correctly tested at the start of each substring. For |
a lookbehind assertion later in the pattern could require even earlier |
5772 |
example, using pcre_dfa_exec(), you could pass the subject in chunks |
characters to be inspected, and it might not have been reached when a |
5773 |
that are 500 bytes long, but in a buffer of 700 bytes, with the start- |
partial match occurs. This is probably an extremely unlikely case; you |
5774 |
ing offset set to 200 and the previous 200 bytes at the start of the |
could guard against it to a certain extent by always including extra |
5775 |
buffer. |
characters at the start. |
5776 |
|
|
5777 |
3. Matching a subject string that is split into multiple segments may |
3. Matching a subject string that is split into multiple segments may |
5778 |
not always produce exactly the same result as matching over one single |
not always produce exactly the same result as matching over one single |
5779 |
long string, especially when PCRE_PARTIAL_SOFT is used. The section |
long string, especially when PCRE_PARTIAL_SOFT is used. The section |
5780 |
"Partial Matching and Word Boundaries" above describes an issue that |
"Partial Matching and Word Boundaries" above describes an issue that |
5781 |
arises if the pattern ends with \b or \B. Another kind of difference |
arises if the pattern ends with \b or \B. Another kind of difference |
5782 |
may occur when there are multiple matching possibilities, because a |
may occur when there are multiple matching possibilities, because a |
5783 |
partial match result is given only when there are no completed matches. |
partial match result is given only when there are no completed matches. |
5784 |
This means that as soon as the shortest match has been found, continua- |
This means that as soon as the shortest match has been found, continua- |
5785 |
tion to a new subject segment is no longer possible. Consider again |
tion to a new subject segment is no longer possible. Consider again |
5786 |
this pcretest example: |
this pcretest example: |
5787 |
|
|
5788 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
5796 |
0: dogsbody |
0: dogsbody |
5797 |
1: dog |
1: dog |
5798 |
|
|
5799 |
The first data line passes the string "dogsb" to pcre_exec(), setting |
The first data line passes the string "dogsb" to pcre_exec(), setting |
5800 |
the PCRE_PARTIAL_SOFT option. Although the string is a partial match |
the PCRE_PARTIAL_SOFT option. Although the string is a partial match |
5801 |
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the |
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the |
5802 |
shorter string "dog" is a complete match. Similarly, when the subject |
shorter string "dog" is a complete match. Similarly, when the subject |
5803 |
is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being |
is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being |
5804 |
the first two) the match stops when "dog" has been found, and it is not |
the first two) the match stops when "dog" has been found, and it is not |
5805 |
possible to continue. On the other hand, if "dogsbody" is presented as |
possible to continue. On the other hand, if "dogsbody" is presented as |
5806 |
a single string, pcre_dfa_exec() finds both matches. |
a single string, pcre_dfa_exec() finds both matches. |
5807 |
|
|
5808 |
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD |
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD |
5809 |
when matching multi-segment data. The example above then behaves dif- |
when matching multi-segment data. The example above then behaves dif- |
5810 |
ferently: |
ferently: |
5811 |
|
|
5812 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
5819 |
|
|
5820 |
|
|
5821 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
5822 |
start with the same pattern item may not work as expected when |
start with the same pattern item may not work as expected when |
5823 |
pcre_dfa_exec() is used. For example, consider this pattern: |
pcre_dfa_exec() is used. For example, consider this pattern: |
5824 |
|
|
5825 |
1234|3789 |
1234|3789 |
5826 |
|
|
5827 |
If the first part of the subject is "ABC123", a partial match of the |
If the first part of the subject is "ABC123", a partial match of the |
5828 |
first alternative is found at offset 3. There is no partial match for |
first alternative is found at offset 3. There is no partial match for |
5829 |
the second alternative, because such a match does not start at the same |
the second alternative, because such a match does not start at the same |
5830 |
point in the subject string. Attempting to continue with the string |
point in the subject string. Attempting to continue with the string |
5831 |
"7890" does not yield a match because only those alternatives that |
"7890" does not yield a match because only those alternatives that |
5832 |
match at one point in the subject are remembered. The problem arises |
match at one point in the subject are remembered. The problem arises |
5833 |
because the start of the second alternative matches within the first |
because the start of the second alternative matches within the first |
5834 |
alternative. There is no problem with anchored patterns or patterns |
alternative. There is no problem with anchored patterns or patterns |
5835 |
such as: |
such as: |
5836 |
|
|
5837 |
1234|ABCD |
1234|ABCD |
5838 |
|
|
5839 |
where no string can be a partial match for both alternatives. This is |
where no string can be a partial match for both alternatives. This is |
5840 |
not a problem if pcre_exec() is used, because the entire match has to |
not a problem if pcre_exec() is used, because the entire match has to |
5841 |
be rerun each time: |
be rerun each time: |
5842 |
|
|
5843 |
re> /1234|3789/ |
re> /1234|3789/ |
5856 |
|
|
5857 |
REVISION |
REVISION |
5858 |
|
|
5859 |
Last updated: 31 August 2009 |
Last updated: 05 September 2009 |
5860 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5861 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5862 |
|
|
6178 |
easier to slot in PCRE as a replacement library. Other POSIX options |
easier to slot in PCRE as a replacement library. Other POSIX options |
6179 |
are not even defined. |
are not even defined. |
6180 |
|
|
6181 |
|
There are also some other options that are not defined by POSIX. These |
6182 |
|
have been added at the request of users who want to make use of certain |
6183 |
|
PCRE-specific features via the POSIX calling interface. |
6184 |
|
|
6185 |
When PCRE is called via these functions, it is only the API that is |
When PCRE is called via these functions, it is only the API that is |
6186 |
POSIX-like in style. The syntax and semantics of the regular expres- |
POSIX-like in style. The syntax and semantics of the regular expres- |
6187 |
sions themselves are still those of Perl, subject to the setting of |
sions themselves are still those of Perl, subject to the setting of |
6236 |
ing, the nmatch and pmatch arguments are ignored, and no captured |
ing, the nmatch and pmatch arguments are ignored, and no captured |
6237 |
strings are returned. |
strings are returned. |
6238 |
|
|
6239 |
|
REG_UNGREEDY |
6240 |
|
|
6241 |
|
The PCRE_UNGREEDY option is set when the regular expression is passed |
6242 |
|
for compilation to the native function. Note that REG_UNGREEDY is not |
6243 |
|
part of the POSIX standard. |
6244 |
|
|
6245 |
REG_UTF8 |
REG_UTF8 |
6246 |
|
|
6247 |
The PCRE_UTF8 option is set when the regular expression is passed for |
The PCRE_UTF8 option is set when the regular expression is passed for |
6254 |
semantics. In particular, the way it handles newline characters in the |
semantics. In particular, the way it handles newline characters in the |
6255 |
subject string is the Perl way, not the POSIX way. Note that setting |
subject string is the Perl way, not the POSIX way. Note that setting |
6256 |
PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE. |
PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE. |
6257 |
It does not affect the way newlines are matched by . (they aren't) or |
It does not affect the way newlines are matched by . (they are not) or |
6258 |
by a negative class such as [^a] (they are). |
by a negative class such as [^a] (they are). |
6259 |
|
|
6260 |
The yield of regcomp() is zero on success, and non-zero otherwise. The |
The yield of regcomp() is zero on success, and non-zero otherwise. The |
6341 |
matched strings is returned. The nmatch and pmatch arguments of |
matched strings is returned. The nmatch and pmatch arguments of |
6342 |
regexec() are ignored. |
regexec() are ignored. |
6343 |
|
|
6344 |
|
If the value of nmatch is zero, or if the value pmatch is NULL, no data |
6345 |
|
about any matched strings is returned. |
6346 |
|
|
6347 |
Otherwise,the portion of the string that was matched, and also any cap- |
Otherwise,the portion of the string that was matched, and also any cap- |
6348 |
tured substrings, are returned via the pmatch argument, which points to |
tured substrings, are returned via the pmatch argument, which points to |
6349 |
an array of nmatch structures of type regmatch_t, containing the mem- |
an array of nmatch structures of type regmatch_t, containing the mem- |
6350 |
bers rm_so and rm_eo. These contain the offset to the first character |
bers rm_so and rm_eo. These contain the offset to the first character |
6351 |
of each substring and the offset to the first character after the end |
of each substring and the offset to the first character after the end |
6352 |
of each substring, respectively. The 0th element of the vector relates |
of each substring, respectively. The 0th element of the vector relates |
6353 |
to the entire portion of string that was matched; subsequent elements |
to the entire portion of string that was matched; subsequent elements |
6354 |
relate to the capturing subpatterns of the regular expression. Unused |
relate to the capturing subpatterns of the regular expression. Unused |
6355 |
entries in the array have both structure members set to -1. |
entries in the array have both structure members set to -1. |
6356 |
|
|
6357 |
A successful match yields a zero return; various error codes are |
A successful match yields a zero return; various error codes are |
6358 |
defined in the header file, of which REG_NOMATCH is the "expected" |
defined in the header file, of which REG_NOMATCH is the "expected" |
6359 |
failure code. |
failure code. |
6360 |
|
|
6361 |
|
|
6362 |
ERROR MESSAGES |
ERROR MESSAGES |
6363 |
|
|
6364 |
The regerror() function maps a non-zero errorcode from either regcomp() |
The regerror() function maps a non-zero errorcode from either regcomp() |
6365 |
or regexec() to a printable message. If preg is not NULL, the error |
or regexec() to a printable message. If preg is not NULL, the error |
6366 |
should have arisen from the use of that structure. A message terminated |
should have arisen from the use of that structure. A message terminated |
6367 |
by a binary zero is placed in errbuf. The length of the message, |
by a binary zero is placed in errbuf. The length of the message, |
6368 |
including the zero, is limited to errbuf_size. The yield of the func- |
including the zero, is limited to errbuf_size. The yield of the func- |
6369 |
tion is the size of buffer needed to hold the whole message. |
tion is the size of buffer needed to hold the whole message. |
6370 |
|
|
6371 |
|
|
6372 |
MEMORY USAGE |
MEMORY USAGE |
6373 |
|
|
6374 |
Compiling a regular expression causes memory to be allocated and asso- |
Compiling a regular expression causes memory to be allocated and asso- |
6375 |
ciated with the preg structure. The function regfree() frees all such |
ciated with the preg structure. The function regfree() frees all such |
6376 |
memory, after which preg may no longer be used as a compiled expres- |
memory, after which preg may no longer be used as a compiled expres- |
6377 |
sion. |
sion. |
6378 |
|
|
6379 |
|
|
6386 |
|
|
6387 |
REVISION |
REVISION |
6388 |
|
|
6389 |
Last updated: 15 August 2009 |
Last updated: 02 September 2009 |
6390 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
6391 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6392 |
|
|