/[pcre]/code/trunk/HACKING
ViewVC logotype

Diff of /code/trunk/HACKING

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 181 by ph10, Wed Jun 13 14:55:18 2007 UTC revision 550 by ph10, Sun Oct 10 16:24:11 2010 UTC
# Line 4  Technical Notes about PCRE Line 4  Technical Notes about PCRE
4  These are very rough technical notes that record potentially useful information  These are very rough technical notes that record potentially useful information
5  about PCRE internals.  about PCRE internals.
6    
7    
8  Historical note 1  Historical note 1
9  -----------------  -----------------
10    
# Line 22  the one matching the longest subset of t Line 23  the one matching the longest subset of t
23  not necessarily maximize the individual wild portions of the pattern, as is  not necessarily maximize the individual wild portions of the pattern, as is
24  expected in Unix and Perl-style regular expressions.  expected in Unix and Perl-style regular expressions.
25    
26    
27  Historical note 2  Historical note 2
28  -----------------  -----------------
29    
# Line 34  maximizing (or, optionally, minimizing i Line 36  maximizing (or, optionally, minimizing i
36  matches individual wild portions of the pattern. This is an "NFA algorithm" in  matches individual wild portions of the pattern. This is an "NFA algorithm" in
37  Friedl's terminology.  Friedl's terminology.
38    
39    
40  OK, here's the real stuff  OK, here's the real stuff
41  -------------------------  -------------------------
42    
# Line 44  in the pattern, to save on compiling tim Line 47  in the pattern, to save on compiling tim
47  complexity in Perl regular expressions, I couldn't do this. In any case, a  complexity in Perl regular expressions, I couldn't do this. In any case, a
48  first pass through the pattern is helpful for other reasons.  first pass through the pattern is helpful for other reasons.
49    
50    
51  Computing the memory requirement: how it was  Computing the memory requirement: how it was
52  --------------------------------------------  --------------------------------------------
53    
# Line 54  idea was that this would turn out faster Line 58  idea was that this would turn out faster
58  the first pass is degenerate and the second pass can just store stuff straight  the first pass is degenerate and the second pass can just store stuff straight
59  into the vector, which it knows is big enough.  into the vector, which it knows is big enough.
60    
61    
62  Computing the memory requirement: how it is  Computing the memory requirement: how it is
63  -------------------------------------------  -------------------------------------------
64    
# Line 67  many tests of the mode that might slow i Line 72  many tests of the mode that might slow i
72  functions to work this way. This got rid of about 600 lines of source. It  functions to work this way. This got rid of about 600 lines of source. It
73  should make future maintenance and development easier. As this was such a major  should make future maintenance and development easier. As this was such a major
74  change, I never released 6.8, instead upping the number to 7.0 (other quite  change, I never released 6.8, instead upping the number to 7.0 (other quite
75  major changes are also present in the 7.0 release).  major changes were also present in the 7.0 release).
76    
77  A side effect of this work is that the previous limit of 200 on the nesting  A side effect of this work was that the previous limit of 200 on the nesting
78  depth of parentheses was removed. However, there is a downside: pcre_compile()  depth of parentheses was removed. However, there is a downside: pcre_compile()
79  runs more slowly than before (30% or more, depending on the pattern) because it  runs more slowly than before (30% or more, depending on the pattern) because it
80  is doing a full analysis of the pattern. My hope is that this is not a big  is doing a full analysis of the pattern. My hope was that this would not be a
81  issue.  big issue, and in the event, nobody has commented on it.
82    
83    
84  Traditional matching function  Traditional matching function
85  -----------------------------  -----------------------------
86    
87  The "traditional", and original, matching function is called pcre_exec(), and  The "traditional", and original, matching function is called pcre_exec(), and
88  it implements an NFA algorithm, similar to the original Henry Spencer algorithm  it implements an NFA algorithm, similar to the original Henry Spencer algorithm
89  and the way that Perl works. Not surprising, since it is intended to be as  and the way that Perl works. This is not surprising, since it is intended to be
90  compatible with Perl as possible. This is the function most users of PCRE will  as compatible with Perl as possible. This is the function most users of PCRE
91  use most of the time.  will use most of the time.
92    
93    
94  Supplementary matching function  Supplementary matching function
95  -------------------------------  -------------------------------
# Line 109  variable length. The first byte in an it Line 116  variable length. The first byte in an it
116  item is either implicit in the opcode or contained in the data bytes that  item is either implicit in the opcode or contained in the data bytes that
117  follow it.  follow it.
118    
119  In many cases below "two-byte" data values are specified. This is in fact just  In many cases below LINK_SIZE data values are specified for offsets within the
120  a default when the number is an offset within the compiled pattern. PCRE can be  compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be
121  compiled to use 3-byte or 4-byte values for these offsets (impairing the  compiled to use 3-byte or 4-byte values for these offsets (impairing the
122  performance). This is necessary only when patterns whose compiled length is  performance). This is necessary only when patterns whose compiled length is
123  greater than 64K are going to be processed. In this description, we assume the  greater than 64K are going to be processed. In this description, we assume the
124  "normal" compilation options. "Two-byte" data values that are counts (e.g. for  "normal" compilation options. Data values that are counts (e.g. for
125  quantifiers) are always just two bytes.  quantifiers) are always just two bytes long.
126    
127  A list of all the opcodes follows:  A list of the opcodes follows:
128    
129  Opcodes with no following data  Opcodes with no following data
130  ------------------------------  ------------------------------
# Line 125  Opcodes with no following data Line 132  Opcodes with no following data
132  These items are all just one byte long  These items are all just one byte long
133    
134    OP_END                 end of pattern    OP_END                 end of pattern
135    OP_ANY                 match any character    OP_ANY                 match any one character other than newline
136      OP_ALLANY              match any one character, including newline
137    OP_ANYBYTE             match any single byte, even in UTF-8 mode    OP_ANYBYTE             match any single byte, even in UTF-8 mode
138    OP_SOD                 match start of data: \A    OP_SOD                 match start of data: \A
139    OP_SOM,                start of match (subject + offset): \G    OP_SOM,                start of match (subject + offset): \G
# Line 149  These items are all just one byte long Line 157  These items are all just one byte long
157    OP_EXTUNI              match an extended Unicode character    OP_EXTUNI              match an extended Unicode character
158    OP_ANYNL               match any Unicode newline sequence    OP_ANYNL               match any Unicode newline sequence
159    
160      OP_ACCEPT              ) These are Perl 5.10's "backtracking control
161      OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
162      OP_FAIL                ) parentheses, it may be preceded by one or more
163      OP_PRUNE               ) OP_CLOSE, followed by a 2-byte number,
164      OP_SKIP                ) indicating which parentheses must be closed.
165    
166    
167    Backtracking control verbs with data
168    ------------------------------------
169    
170    OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
171    start of the current branch.
172    
173    OP_MARK is followed by the mark name, preceded by a one-byte length, and
174    followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
175    the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
176    two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
177    offset value.
178    
179    
180  Repeating single characters  Repeating single characters
181  ---------------------------  ---------------------------
# Line 311  maximally respectively. All three are fo Line 338  maximally respectively. All three are fo
338  positive number) the offset back to the matching bracket opcode.  positive number) the offset back to the matching bracket opcode.
339    
340  If a subpattern is quantified such that it is permitted to match zero times, it  If a subpattern is quantified such that it is permitted to match zero times, it
341  is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte  is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
342  opcodes which tell the matcher that skipping this subpattern entirely is a  single-byte opcodes that tell the matcher that skipping the following
343  valid branch.  subpattern entirely is a valid branch. In the case of the first two, not
344    skipping the pattern is also valid (greedy and non-greedy). The third is used
345    when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
346    because it may be called as a subroutine from elsewhere in the regex.
347    
348  A subpattern with an indefinite maximum repetition is replicated in the  A subpattern with an indefinite maximum repetition is replicated in the
349  compiled data its minimum number of times (or once with OP_BRAZERO if the  compiled data its minimum number of times (or once with OP_BRAZERO if the
# Line 361  These are like other subpatterns, but th Line 391  These are like other subpatterns, but th
391  OP_SCOND for one that might match an empty string in an unbounded repeat. If  OP_SCOND for one that might match an empty string in an unbounded repeat. If
392  the condition is a back reference, this is stored at the start of the  the condition is a back reference, this is stored at the start of the
393  subpattern using the opcode OP_CREF followed by two bytes containing the  subpattern using the opcode OP_CREF followed by two bytes containing the
394  reference number. If the condition is "in recursion" (coded as "(?(R)"), or "in  reference number. OP_NCREF is used instead if the reference was generated by
395  recursion of group x" (coded as "(?(Rx)"), the group number is stored at the  name (so that the runtime code knows to check for duplicate names).
396  start of the subpattern using the opcode OP_RREF, and a value of zero for "the  
397  whole pattern". For a DEFINE condition, just the single byte OP_DEF is used (it  If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
398  has no associated data). Otherwise, a conditional subpattern always starts with  group x" (coded as "(?(Rx)"), the group number is stored at the start of the
399  one of the assertions.  subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
400    zero for "the whole pattern". For a DEFINE condition, just the single byte
401    OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
402    always starts with one of the assertions.
403    
404    
405  Recursion  Recursion
# Line 404  at compile time, and so does not cause a Line 437  at compile time, and so does not cause a
437  data.  data.
438    
439  Philip Hazel  Philip Hazel
440  June 2007  October 2010

Legend:
Removed from v.181  
changed lines
  Added in v.550

  ViewVC Help
Powered by ViewVC 1.1.5