PCRE MATCHING ALGORITHMS

@@ -46,7 +46,7 @@ <something> <something else> <something further> there are three possible answers. The standard algorithm finds only one of -them, whereas the DFA algorithm finds all three. +them, whereas the alternative algorithm finds all three.

REGULAR EXPRESSIONS AS TREES

@@ -54,8 +54,8 @@ as a tree structure. An unlimited repetition in the pattern makes the tree of infinite size, but it is still a tree. Matching the pattern to a given subject string (from a given starting point) can be thought of as a search of the tree. -There are two standard ways to search a tree: depth-first and breadth-first, -and these correspond to the two matching algorithms provided by PCRE. +There are two ways to search a tree: depth-first and breadth-first, and these +correspond to the two matching algorithms provided by PCRE.

THE STANDARD MATCHING ALGORITHM

@@ -83,14 +83,15 @@ matched by portions of the pattern in parentheses. This provides support for capturing parentheses and back references.

-THE DFA MATCHING ALGORITHM

+

THE ALTERNATIVE MATCHING ALGORITHM

-DFA stands for "deterministic finite automaton", but you do not need to -understand the origins of that name. This algorithm conducts a breadth-first -search of the tree. Starting from the first matching point in the subject, it -scans the subject string from left to right, once, character by character, and -as it does this, it remembers all the paths through the tree that represent -valid matches. +This algorithm conducts a breadth-first search of the tree. Starting from the +first matching point in the subject, it scans the subject string from left to +right, once, character by character, and as it does this, it remembers all the +paths through the tree that represent valid matches. In Friedl's terminology, +this is a kind of "DFA algorithm", though it is not implemented as a +traditional finite state machine (it keeps multiple states active +simultaneously).

The scan continues until either the end of the subject is reached, or there are @@ -114,12 +115,21 @@

There are a number of features of PCRE regular expressions that are not -supported by the DFA matching algorithm. They are as follows: +supported by the alternative matching algorithm. They are as follows:

1. Because the algorithm finds all possible matches, the greedy or ungreedy nature of repetition quantifiers is not relevant. Greedy and ungreedy -quantifiers are treated in exactly the same way. +quantifiers are treated in exactly the same way. However, possessive +quantifiers can make a difference when what follows could also match what is +quantified, for example in a pattern like this: +

+ ^a++\w! ++This pattern matches "aaab!" but not "aaa!", which would be matched by a +non-possessive quantifier. Similarly, if an atomic group is present, it is +matched as if it were a standalone pattern at the current point, and the +longest match is then "locked in" for the rest of the overall pattern.

2. When dealing with multiple paths through the tree simultaneously, it is not @@ -133,7 +143,7 @@

4. For the same reason, conditional expressions that use a backreference as the -condition are not supported. +condition or test for a specific group recursion are not supported.

5. Callouts are supported, but the value of the *capture_top* field is
@@ -142,13 +152,13 @@

6. The \C escape sequence, which (in the standard algorithm) matches a single -byte, even in UTF-8 mode, is not supported because the DFA algorithm moves -through the subject string one character at a time, for all active paths +byte, even in UTF-8 mode, is not supported because the alternative algorithm +moves through the subject string one character at a time, for all active paths through the tree.

-ADVANTAGES OF THE DFA ALGORITHM

+

ADVANTAGES OF THE ALTERNATIVE ALGORITHM

-Using the DFA matching algorithm provides the following advantages: +Using the alternative matching algorithm provides the following advantages:

1. All possible matches (at a single point in the subject) are automatically @@ -159,17 +169,18 @@

2. There is much better support for partial matching. The restrictions on the content of the pattern that apply when using the standard algorithm for partial -matching do not apply to the DFA algorithm. For non-anchored patterns, the -starting position of a partial match is available. +matching do not apply to the alternative algorithm. For non-anchored patterns, +the starting position of a partial match is available.

-3. Because the DFA algorithm scans the subject string just once, and never -needs to backtrack, it is possible to pass very long subject strings to the -matching function in several pieces, checking for partial matching each time. +3. Because the alternative algorithm scans the subject string just once, and +never needs to backtrack, it is possible to pass very long subject strings to +the matching function in several pieces, checking for partial matching each +time.

-DISADVANTAGES OF THE DFA ALGORITHM

+

DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

-The DFA algorithm suffers from a number of disadvantages: +The alternative algorithm suffers from a number of disadvantages:

1. It is substantially slower than the standard algorithm. This is partly @@ -180,13 +191,13 @@ 2. Capturing parentheses and back references are not supported.

-3. The "atomic group" feature of PCRE regular expressions is supported, but -does not provide the advantage that it does for the standard algorithm. +3. Although atomic groups are supported, their use does not provide the +performance advantage that it does for the standard algorithm.

-Last updated: 28 February 2005
+Last updated: 24 November 2006

-Copyright © 1997-2005 University of Cambridge.
+Copyright © 1997-2006 University of Cambridge.

Return to the PCRE index page.