--- code/trunk/doc/html/pcrematching.html 2007/02/24 21:41:34 91 +++ code/trunk/doc/html/pcrematching.html 2007/04/16 13:25:10 148 @@ -16,9 +16,11 @@
• PCRE MATCHING ALGORITHMS
• REGULAR EXPRESSIONS AS TREES
• THE STANDARD MATCHING ALGORITHM -
• THE DFA MATCHING ALGORITHM -
• ADVANTAGES OF THE DFA ALGORITHM -
• DISADVANTAGES OF THE DFA ALGORITHM +
• THE ALTERNATIVE MATCHING ALGORITHM +
• ADVANTAGES OF THE ALTERNATIVE ALGORITHM +
• DISADVANTAGES OF THE ALTERNATIVE ALGORITHM +
• AUTHOR +
• REVISION
PCRE MATCHING ALGORITHMS

@@ -46,7 +48,7 @@ <something> <something else> <something further> there are three possible answers. The standard algorithm finds only one of -them, whereas the DFA algorithm finds all three. +them, whereas the alternative algorithm finds all three.

REGULAR EXPRESSIONS AS TREES

@@ -59,8 +61,8 @@

THE STANDARD MATCHING ALGORITHM

-In the terminology of Jeffrey Friedl's book \fIMastering Regular -Expressions\fP, the standard algorithm is an "NFA algorithm". It conducts a +In the terminology of Jeffrey Friedl's book "Mastering Regular +Expressions", the standard algorithm is an "NFA algorithm". It conducts a depth-first search of the pattern tree. That is, it proceeds along a single path through the tree, checking that the subject matches what is required. When there is a mismatch, the algorithm tries any alternatives at the current point, @@ -83,14 +85,15 @@ matched by portions of the pattern in parentheses. This provides support for capturing parentheses and back references.

-
THE DFA MATCHING ALGORITHM
+
THE ALTERNATIVE MATCHING ALGORITHM

-DFA stands for "deterministic finite automaton", but you do not need to -understand the origins of that name. This algorithm conducts a breadth-first -search of the tree. Starting from the first matching point in the subject, it -scans the subject string from left to right, once, character by character, and -as it does this, it remembers all the paths through the tree that represent -valid matches. +This algorithm conducts a breadth-first search of the tree. Starting from the +first matching point in the subject, it scans the subject string from left to +right, once, character by character, and as it does this, it remembers all the +paths through the tree that represent valid matches. In Friedl's terminology, +this is a kind of "DFA algorithm", though it is not implemented as a +traditional finite state machine (it keeps multiple states active +simultaneously).

The scan continues until either the end of the subject is reached, or there are @@ -114,12 +117,21 @@

There are a number of features of PCRE regular expressions that are not -supported by the DFA matching algorithm. They are as follows: +supported by the alternative matching algorithm. They are as follows:

1. Because the algorithm finds all possible matches, the greedy or ungreedy nature of repetition quantifiers is not relevant. Greedy and ungreedy -quantifiers are treated in exactly the same way. +quantifiers are treated in exactly the same way. However, possessive +quantifiers can make a difference when what follows could also match what is +quantified, for example in a pattern like this: +

```+  ^a++\w!
+```
+This pattern matches "aaab!" but not "aaa!", which would be matched by a +non-possessive quantifier. Similarly, if an atomic group is present, it is +matched as if it were a standalone pattern at the current point, and the +longest match is then "locked in" for the rest of the overall pattern.

2. When dealing with multiple paths through the tree simultaneously, it is not @@ -133,7 +145,7 @@

4. For the same reason, conditional expressions that use a backreference as the -condition are not supported. +condition or test for a specific group recursion are not supported.

5. Callouts are supported, but the value of the capture_top field is @@ -142,13 +154,13 @@

6. The \C escape sequence, which (in the standard algorithm) matches a single -byte, even in UTF-8 mode, is not supported because the DFA algorithm moves -through the subject string one character at a time, for all active paths +byte, even in UTF-8 mode, is not supported because the alternative algorithm +moves through the subject string one character at a time, for all active paths through the tree.

-
+

-Using the DFA matching algorithm provides the following advantages: +Using the alternative matching algorithm provides the following advantages:

1. All possible matches (at a single point in the subject) are automatically @@ -159,17 +171,18 @@

2. There is much better support for partial matching. The restrictions on the content of the pattern that apply when using the standard algorithm for partial -matching do not apply to the DFA algorithm. For non-anchored patterns, the -starting position of a partial match is available. +matching do not apply to the alternative algorithm. For non-anchored patterns, +the starting position of a partial match is available.

-3. Because the DFA algorithm scans the subject string just once, and never -needs to backtrack, it is possible to pass very long subject strings to the -matching function in several pieces, checking for partial matching each time. +3. Because the alternative algorithm scans the subject string just once, and +never needs to backtrack, it is possible to pass very long subject strings to +the matching function in several pieces, checking for partial matching each +time.

-
+

-The DFA algorithm suffers from a number of disadvantages: +The alternative algorithm suffers from a number of disadvantages:

1. It is substantially slower than the standard algorithm. This is partly @@ -180,13 +193,24 @@ 2. Capturing parentheses and back references are not supported.

-3. The "atomic group" feature of PCRE regular expressions is supported, but -does not provide the advantage that it does for the standard algorithm. +3. Although atomic groups are supported, their use does not provide the +performance advantage that it does for the standard algorithm. +

+
AUTHOR
+

+Philip Hazel +
+University Computing Service +
+Cambridge CB2 3QH, England. +

+
REVISION

-Last updated: 06 June 2006 +Last updated: 06 March 2007 +