--- code/trunk/doc/html/pcrematching.html 2007/02/24 21:40:45 77 +++ code/trunk/doc/html/pcrematching.html 2007/04/16 13:25:10 148 @@ -16,9 +16,11 @@
  • PCRE MATCHING ALGORITHMS
  • REGULAR EXPRESSIONS AS TREES
  • THE STANDARD MATCHING ALGORITHM -
  • THE DFA MATCHING ALGORITHM -
  • ADVANTAGES OF THE DFA ALGORITHM -
  • DISADVANTAGES OF THE DFA ALGORITHM +
  • THE ALTERNATIVE MATCHING ALGORITHM +
  • ADVANTAGES OF THE ALTERNATIVE ALGORITHM +
  • DISADVANTAGES OF THE ALTERNATIVE ALGORITHM +
  • AUTHOR +
  • REVISION
    PCRE MATCHING ALGORITHMS

    @@ -46,7 +48,7 @@ <something> <something else> <something further> there are three possible answers. The standard algorithm finds only one of -them, whereas the DFA algorithm finds all three. +them, whereas the alternative algorithm finds all three.


    REGULAR EXPRESSIONS AS TREES

    @@ -54,13 +56,13 @@ as a tree structure. An unlimited repetition in the pattern makes the tree of infinite size, but it is still a tree. Matching the pattern to a given subject string (from a given starting point) can be thought of as a search of the tree. -There are two standard ways to search a tree: depth-first and breadth-first, -and these correspond to the two matching algorithms provided by PCRE. +There are two ways to search a tree: depth-first and breadth-first, and these +correspond to the two matching algorithms provided by PCRE.


    THE STANDARD MATCHING ALGORITHM

    -In the terminology of Jeffrey Friedl's book \fIMastering Regular -Expressions\fP, the standard algorithm is an "NFA algorithm". It conducts a +In the terminology of Jeffrey Friedl's book "Mastering Regular +Expressions", the standard algorithm is an "NFA algorithm". It conducts a depth-first search of the pattern tree. That is, it proceeds along a single path through the tree, checking that the subject matches what is required. When there is a mismatch, the algorithm tries any alternatives at the current point, @@ -83,14 +85,15 @@ matched by portions of the pattern in parentheses. This provides support for capturing parentheses and back references.

    -
    THE DFA MATCHING ALGORITHM
    +
    THE ALTERNATIVE MATCHING ALGORITHM

    -DFA stands for "deterministic finite automaton", but you do not need to -understand the origins of that name. This algorithm conducts a breadth-first -search of the tree. Starting from the first matching point in the subject, it -scans the subject string from left to right, once, character by character, and -as it does this, it remembers all the paths through the tree that represent -valid matches. +This algorithm conducts a breadth-first search of the tree. Starting from the +first matching point in the subject, it scans the subject string from left to +right, once, character by character, and as it does this, it remembers all the +paths through the tree that represent valid matches. In Friedl's terminology, +this is a kind of "DFA algorithm", though it is not implemented as a +traditional finite state machine (it keeps multiple states active +simultaneously).

    The scan continues until either the end of the subject is reached, or there are @@ -114,12 +117,21 @@

    There are a number of features of PCRE regular expressions that are not -supported by the DFA matching algorithm. They are as follows: +supported by the alternative matching algorithm. They are as follows:

    1. Because the algorithm finds all possible matches, the greedy or ungreedy nature of repetition quantifiers is not relevant. Greedy and ungreedy -quantifiers are treated in exactly the same way. +quantifiers are treated in exactly the same way. However, possessive +quantifiers can make a difference when what follows could also match what is +quantified, for example in a pattern like this: +

    +  ^a++\w!
    +
    +This pattern matches "aaab!" but not "aaa!", which would be matched by a +non-possessive quantifier. Similarly, if an atomic group is present, it is +matched as if it were a standalone pattern at the current point, and the +longest match is then "locked in" for the rest of the overall pattern.

    2. When dealing with multiple paths through the tree simultaneously, it is not @@ -133,7 +145,7 @@

    4. For the same reason, conditional expressions that use a backreference as the -condition are not supported. +condition or test for a specific group recursion are not supported.

    5. Callouts are supported, but the value of the capture_top field is @@ -142,13 +154,13 @@

    6. The \C escape sequence, which (in the standard algorithm) matches a single -byte, even in UTF-8 mode, is not supported because the DFA algorithm moves -through the subject string one character at a time, for all active paths +byte, even in UTF-8 mode, is not supported because the alternative algorithm +moves through the subject string one character at a time, for all active paths through the tree.

    -
    ADVANTAGES OF THE DFA ALGORITHM
    +
    ADVANTAGES OF THE ALTERNATIVE ALGORITHM

    -Using the DFA matching algorithm provides the following advantages: +Using the alternative matching algorithm provides the following advantages:

    1. All possible matches (at a single point in the subject) are automatically @@ -159,17 +171,18 @@

    2. There is much better support for partial matching. The restrictions on the content of the pattern that apply when using the standard algorithm for partial -matching do not apply to the DFA algorithm. For non-anchored patterns, the -starting position of a partial match is available. +matching do not apply to the alternative algorithm. For non-anchored patterns, +the starting position of a partial match is available.

    -3. Because the DFA algorithm scans the subject string just once, and never -needs to backtrack, it is possible to pass very long subject strings to the -matching function in several pieces, checking for partial matching each time. +3. Because the alternative algorithm scans the subject string just once, and +never needs to backtrack, it is possible to pass very long subject strings to +the matching function in several pieces, checking for partial matching each +time.

    -
    DISADVANTAGES OF THE DFA ALGORITHM
    +
    DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

    -The DFA algorithm suffers from a number of disadvantages: +The alternative algorithm suffers from a number of disadvantages:

    1. It is substantially slower than the standard algorithm. This is partly @@ -180,13 +193,24 @@ 2. Capturing parentheses and back references are not supported.

    -3. The "atomic group" feature of PCRE regular expressions is supported, but -does not provide the advantage that it does for the standard algorithm. +3. Although atomic groups are supported, their use does not provide the +performance advantage that it does for the standard algorithm. +

    +
    AUTHOR
    +

    +Philip Hazel +
    +University Computing Service +
    +Cambridge CB2 3QH, England. +

    +
    REVISION

    -Last updated: 28 February 2005 +Last updated: 06 March 2007 +
    +Copyright © 1997-2007 University of Cambridge.
    -Copyright © 1997-2005 University of Cambridge.

    Return to the PCRE index page.