--- code/trunk/doc/html/pcrematching.html 2007/02/24 21:40:45 77 +++ code/trunk/doc/html/pcrematching.html 2007/04/16 13:25:10 148 @@ -16,9 +16,11 @@
@@ -46,7 +48,7 @@ <something> <something else> <something further> there are three possible answers. The standard algorithm finds only one of -them, whereas the DFA algorithm finds all three. +them, whereas the alternative algorithm finds all three.
@@ -54,13 +56,13 @@ as a tree structure. An unlimited repetition in the pattern makes the tree of infinite size, but it is still a tree. Matching the pattern to a given subject string (from a given starting point) can be thought of as a search of the tree. -There are two standard ways to search a tree: depth-first and breadth-first, -and these correspond to the two matching algorithms provided by PCRE. +There are two ways to search a tree: depth-first and breadth-first, and these +correspond to the two matching algorithms provided by PCRE.
-In the terminology of Jeffrey Friedl's book \fIMastering Regular -Expressions\fP, the standard algorithm is an "NFA algorithm". It conducts a +In the terminology of Jeffrey Friedl's book "Mastering Regular +Expressions", the standard algorithm is an "NFA algorithm". It conducts a depth-first search of the pattern tree. That is, it proceeds along a single path through the tree, checking that the subject matches what is required. When there is a mismatch, the algorithm tries any alternatives at the current point, @@ -83,14 +85,15 @@ matched by portions of the pattern in parentheses. This provides support for capturing parentheses and back references.
--DFA stands for "deterministic finite automaton", but you do not need to -understand the origins of that name. This algorithm conducts a breadth-first -search of the tree. Starting from the first matching point in the subject, it -scans the subject string from left to right, once, character by character, and -as it does this, it remembers all the paths through the tree that represent -valid matches. +This algorithm conducts a breadth-first search of the tree. Starting from the +first matching point in the subject, it scans the subject string from left to +right, once, character by character, and as it does this, it remembers all the +paths through the tree that represent valid matches. In Friedl's terminology, +this is a kind of "DFA algorithm", though it is not implemented as a +traditional finite state machine (it keeps multiple states active +simultaneously).
The scan continues until either the end of the subject is reached, or there are @@ -114,12 +117,21 @@
There are a number of features of PCRE regular expressions that are not -supported by the DFA matching algorithm. They are as follows: +supported by the alternative matching algorithm. They are as follows:
1. Because the algorithm finds all possible matches, the greedy or ungreedy nature of repetition quantifiers is not relevant. Greedy and ungreedy -quantifiers are treated in exactly the same way. +quantifiers are treated in exactly the same way. However, possessive +quantifiers can make a difference when what follows could also match what is +quantified, for example in a pattern like this: +
+ ^a++\w! ++This pattern matches "aaab!" but not "aaa!", which would be matched by a +non-possessive quantifier. Similarly, if an atomic group is present, it is +matched as if it were a standalone pattern at the current point, and the +longest match is then "locked in" for the rest of the overall pattern.
2. When dealing with multiple paths through the tree simultaneously, it is not @@ -133,7 +145,7 @@
4. For the same reason, conditional expressions that use a backreference as the -condition are not supported. +condition or test for a specific group recursion are not supported.
5. Callouts are supported, but the value of the capture_top field is @@ -142,13 +154,13 @@
6. The \C escape sequence, which (in the standard algorithm) matches a single -byte, even in UTF-8 mode, is not supported because the DFA algorithm moves -through the subject string one character at a time, for all active paths +byte, even in UTF-8 mode, is not supported because the alternative algorithm +moves through the subject string one character at a time, for all active paths through the tree.
--Using the DFA matching algorithm provides the following advantages: +Using the alternative matching algorithm provides the following advantages:
1. All possible matches (at a single point in the subject) are automatically @@ -159,17 +171,18 @@
2. There is much better support for partial matching. The restrictions on the content of the pattern that apply when using the standard algorithm for partial -matching do not apply to the DFA algorithm. For non-anchored patterns, the -starting position of a partial match is available. +matching do not apply to the alternative algorithm. For non-anchored patterns, +the starting position of a partial match is available.
-3. Because the DFA algorithm scans the subject string just once, and never -needs to backtrack, it is possible to pass very long subject strings to the -matching function in several pieces, checking for partial matching each time. +3. Because the alternative algorithm scans the subject string just once, and +never needs to backtrack, it is possible to pass very long subject strings to +the matching function in several pieces, checking for partial matching each +time.
--The DFA algorithm suffers from a number of disadvantages: +The alternative algorithm suffers from a number of disadvantages:
1. It is substantially slower than the standard algorithm. This is partly @@ -180,13 +193,24 @@ 2. Capturing parentheses and back references are not supported.
-3. The "atomic group" feature of PCRE regular expressions is supported, but -does not provide the advantage that it does for the standard algorithm. +3. Although atomic groups are supported, their use does not provide the +performance advantage that it does for the standard algorithm. +
+
+Philip Hazel
+
+University Computing Service
+
+Cambridge CB2 3QH, England.
+
-Last updated: 28 February 2005
+Last updated: 06 March 2007
+
+Copyright © 1997-2007 University of Cambridge.
-Copyright © 1997-2005 University of Cambridge.
Return to the PCRE index page.