/[pcre]/code/trunk/maint/Tech.Notes
ViewVC logotype

Diff of /code/trunk/maint/Tech.Notes

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 1  Line 1 
1  Technical Notes about PCRE  Technical Notes about PCRE
2  --------------------------  --------------------------
3    
4    These are very rough technical notes that record potentially useful information
5    about PCRE internals.
6    
7  Historical note 1  Historical note 1
8  -----------------  -----------------
9    
# Line 21  the pattern, as is expected in Unix and Line 24  the pattern, as is expected in Unix and
24  Historical note 2  Historical note 2
25  -----------------  -----------------
26    
27  By contrast, the code originally written by Henry Spencer and subsequently  By contrast, the code originally written by Henry Spencer (which was
28  heavily modified for Perl actually compiles the expression twice: once in a  subsequently heavily modified for Perl) compiles the expression twice: once in
29  dummy mode in order to find out how much store will be needed, and then for  a dummy mode in order to find out how much store will be needed, and then for
30  real. The execution function operates by backtracking and maximizing (or,  real. (The Perl version probably doesn't do this any more; I'm talking about
31  optionally, minimizing in Perl) the amount of the subject that matches  the original library.) The execution function operates by backtracking and
32  individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's  maximizing (or, optionally, minimizing in Perl) the amount of the subject that
33  terminology.  matches individual wild portions of the pattern. This is an "NFA algorithm" in
34    Friedl's terminology.
35    
36  OK, here's the real stuff  OK, here's the real stuff
37  -------------------------  -------------------------
38    
39  For the set of functions that forms PCRE (which are unrelated to those  For the set of functions that form the "basic" PCRE library (which are
40  mentioned above), I tried at first to invent an algorithm that used an amount  unrelated to those mentioned above), I tried at first to invent an algorithm
41  of store bounded by a multiple of the number of characters in the pattern, to  that used an amount of store bounded by a multiple of the number of characters
42  save on compiling time. However, because of the greater complexity in Perl  in the pattern, to save on compiling time. However, because of the greater
43  regular expressions, I couldn't do this. In any case, a first pass through the  complexity in Perl regular expressions, I couldn't do this. In any case, a
44  pattern is needed, for a number of reasons. PCRE works by running a very  first pass through the pattern is needed, for a number of reasons. PCRE works
45  degenerate first pass to calculate a maximum store size, and then a second pass  by running a very degenerate first pass to calculate a maximum store size, and
46  to do the real compile - which may use a bit less than the predicted amount of  then a second pass to do the real compile - which may use a bit less than the
47  store. The idea is that this is going to turn out faster because the first pass  predicted amount of store. The idea is that this is going to turn out faster
48  is degenerate and the second pass can just store stuff straight into the  because the first pass is degenerate and the second pass can just store stuff
49  vector. It does make the compiling functions bigger, of course, but they have  straight into the vector, which it knows is big enough. It does make the
50  got quite big anyway to handle all the Perl stuff.  compiling functions bigger, of course, but they have become quite big anyway to
51    handle all the Perl stuff.
52    
53    Traditional matching function
54    -----------------------------
55    
56    The "traditional", and original, matching function is called pcre_exec(), and
57    it implements an NFA algorithm, similar to the original Henry Spencer algorithm
58    and the way that Perl works. Not surprising, since it is intended to be as
59    compatible with Perl as possible. This is the function most users of PCRE will
60    use most of the time.
61    
62    Supplementary matching function
63    -------------------------------
64    
65    From PCRE 6.0, there is also a supplementary matching function called
66    pcre_dfa_exec(). This implements a DFA matching algorithm that searches
67    simultaneously for all possible matches that start at one point in the subject
68    string. (Going back to my roots: see Historical Note 1 above.) This function
69    intreprets the same compiled pattern data as pcre_exec(); however, not all the
70    facilities are available, and those that are do not always work in quite the
71    same way. See the user documentation for details.
72    
73    Format of compiled patterns
74    ---------------------------
75    
76  The compiled form of a pattern is a vector of bytes, containing items of  The compiled form of a pattern is a vector of bytes, containing items of
77  variable length. The first byte in an item is an opcode, and the length of the  variable length. The first byte in an item is an opcode, and the length of the
# Line 133  Match by Unicode property Line 161  Match by Unicode property
161    
162  OP_PROP and OP_NOTPROP are used for positive and negative matches of a  OP_PROP and OP_NOTPROP are used for positive and negative matches of a
163  character by testing its Unicode property (the \p and \P escape sequences).  character by testing its Unicode property (the \p and \P escape sequences).
164  Each is followed by a single byte that encodes the desired property value.  Each is followed by two bytes that encode the desired property as a type and a
165    value.
166    
167  Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by two  Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
168  bytes: OP_PROP or OP_NOTPROP and then the desired property value.  three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
169    value.
170    
171    
172  Matching literal characters  Matching literal characters
# Line 223  number. This opcode is ignored while mat Line 253  number. This opcode is ignored while mat
253  the bracket itself. (They could have all been done like this, but I was making  the bracket itself. (They could have all been done like this, but I was making
254  minimal changes.)  minimal changes.)
255    
256  A bracket opcode is followed by two bytes which give the offset to the next  A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
257  alternative OP_ALT or, if there aren't any branches, to the matching OP_KET  next alternative OP_ALT or, if there aren't any branches, to the matching
258  opcode. Each OP_ALT is followed by two bytes giving the offset to the next one,  OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
259  or to the OP_KET opcode.  the next one, or to the OP_KET opcode.
260    
261  OP_KET is used for subpatterns that do not repeat indefinitely, while  OP_KET is used for subpatterns that do not repeat indefinitely, while
262  OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or  OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
263  maximally respectively. All three are followed by two bytes giving (as a  maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
264  positive number) the offset back to the matching OP_BRA opcode.  positive number) the offset back to the matching OP_BRA opcode.
265    
266  If a subpattern is quantified such that it is permitted to match zero times, it  If a subpattern is quantified such that it is permitted to match zero times, it
# Line 285  Recursion Line 315  Recursion
315    
316  Recursion either matches the current regex, or some subexpression. The opcode  Recursion either matches the current regex, or some subexpression. The opcode
317  OP_RECURSE is followed by an value which is the offset to the starting bracket  OP_RECURSE is followed by an value which is the offset to the starting bracket
318  from the start of the whole pattern.  from the start of the whole pattern. From release 6.5, OP_RECURSE is
319    automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
320    broke it). OP_RECURSE is also used for "subroutine" calls, even though they
321    are not strictly a recursion.
322    
323    
324  Callout  Callout
# Line 312  at compile time, and so does not cause a Line 345  at compile time, and so does not cause a
345  data.  data.
346    
347  Philip Hazel  Philip Hazel
348  September 2004  June 2006

Legend:
Removed from v.75  
changed lines
  Added in v.91

  ViewVC Help
Powered by ViewVC 1.1.5