/[pcre]/code/trunk/Tech.Notes
ViewVC logotype

Diff of /code/trunk/Tech.Notes

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 22 by nigel, Sat Feb 24 21:38:21 2007 UTC revision 23 by nigel, Sat Feb 24 21:38:41 2007 UTC
# Line 18  expected in Unix and Perl-style regular Line 18  expected in Unix and Perl-style regular
18  By contrast, the code originally written by Henry Spencer and subsequently  By contrast, the code originally written by Henry Spencer and subsequently
19  heavily modified for Perl actually compiles the expression twice: once in a  heavily modified for Perl actually compiles the expression twice: once in a
20  dummy mode in order to find out how much store will be needed, and then for  dummy mode in order to find out how much store will be needed, and then for
21  real. The execution function operates by backtracking and maximizing (or  real. The execution function operates by backtracking and maximizing (or,
22  minimizing in Perl) the amount of the subject that matches individual wild  optionally, minimizing in Perl) the amount of the subject that matches
23  portions of the pattern. This is a "NFA algorithm".  individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's
24    terminology.
25    
26  For this set of functions, I tried at first to invent an algorithm that used an  For this set of functions, I tried at first to invent an algorithm that used an
27  amount of store bounded by a multiple of the number of characters in the  amount of store bounded by a multiple of the number of characters in the
28  pattern, to save on compiling time. However, because of the greater complexity  pattern, to save on compiling time. However, because of the greater complexity
29  in Perl regular expressions, I couldn't do this. In any case, a first pass  in Perl regular expressions, I couldn't do this. In any case, a first pass
30  through the pattern is needed, in order to find internal flag settings like  through the pattern is needed, in order to find internal flag settings like
31  (?i). So it works by running a very degenerate first pass to calculate a  (?i) at top level. So it works by running a very degenerate first pass to
32  maximum store size, and then a second pass to do the real compile - which may  calculate a maximum store size, and then a second pass to do the real compile -
33  use a bit less than the predicted amount of store. The idea is that this is  which may use a bit less than the predicted amount of store. The idea is that
34  going to turn out faster because the first pass is degenerate and the second  this is going to turn out faster because the first pass is degenerate and the
35  can just store stuff straight into the vector. It does make the compiling  second can just store stuff straight into the vector. It does make the
36  functions bigger, of course, but they have got quite big anyway to handle all  compiling functions bigger, of course, but they have got quite big anyway to
37  the Perl stuff.  handle all the Perl stuff.
38    
39  The compiled form of a pattern is a vector of bytes, containing items of  The compiled form of a pattern is a vector of bytes, containing items of
40  variable length. The first byte in an item is an opcode, and the length of the  variable length. The first byte in an item is an opcode, and the length of the
# Line 57  These items are all just one byte long Line 58  These items are all just one byte long
58    OP_WHITESPACE          \s    OP_WHITESPACE          \s
59    OP_NOT_WORDCHAR        \W    OP_NOT_WORDCHAR        \W
60    OP_WORDCHAR            \w    OP_WORDCHAR            \w
61    OP_CUT                 analogue of Prolog's "cut"    OP_EODN                match end of data or \n at end: \Z
62    OP_EOD                 match end of data: \Z    OP_EOD                 match end of data: \z
63    OP_DOLL                $ (end of data, or before \n in multiline)    OP_DOLL                $ (end of data, or before \n in multiline)
64    
65    
# Line 196  minimum. In effect, (abc){2,5} becomes ( Line 197  minimum. In effect, (abc){2,5} becomes (
197  Assertions  Assertions
198  ----------  ----------
199    
200  Assertions are just like other subpatterns, but starting with one of the  Forward assertions are just like other subpatterns, but starting with one of
201  opcodes OP_ASSERT or OP_ASSERT_NOT.  the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
202    OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
203    is OP_REVERSE, followed by a two byte count of the number of characters to move
204    back. A separate count is present in each alternative of a lookbehind
205    assertion, allowing them to have different fixed lengths.
206    
207    
208  Once-only subpatterns  Once-only subpatterns
# Line 207  These are also just like other subpatter Line 212  These are also just like other subpatter
212  OP_ONCE.  OP_ONCE.
213    
214    
215    Conditional subpatterns
216    -----------------------
217    
218    These are like other subpatterns, but they start with the opcode OP_COND. If
219    the condition is a back reference, this is stored at the start of the
220    subpattern using the opcode OP_CREF followed by one byte containing the
221    reference number. Otherwise, a conditional subpattern will always start with
222    one of the assertions.
223    
224    
225    Changing options
226    ----------------
227    
228    If any of the /i, /m, or /s options are changed within a parenthesized group,
229    an OP_OPT opcode is compiled, followed by one byte containing the new settings
230    of these flags. If there are several alternatives in a group, there is an
231    occurrence of OP_OPT at the start of all those following the first options
232    change, to set appropriate options for the start of the alternative.
233    Immediately after the end of the group there is another such item to reset the
234    flags to their previous values. Other changes of flag within the pattern can be
235    handled entirely at compile time, and so do not cause anything to be put into
236    the compiled data.
237    
238    
239  Philip Hazel  Philip Hazel
240  December 1997  September 1998

Legend:
Removed from v.22  
changed lines
  Added in v.23

  ViewVC Help
Powered by ViewVC 1.1.5