/[pcre]/code/trunk/maint/MultiStage2.py
ViewVC logotype

Contents of /code/trunk/maint/MultiStage2.py

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1688 - (show annotations)
Fri Feb 24 17:30:30 2017 UTC (3 years, 9 months ago) by ph10
File MIME type: text/x-python
File size: 22049 byte(s)
Error occurred while calculating annotation data.
Fix Unicode property crash for 32-bit characters greater than 0x10ffff.
1 #! /usr/bin/python
2
3 # WARNING! This is a python 2 script.
4
5 # Multistage table builder
6 # (c) Peter Kankowski, 2008
7
8 ##############################################################################
9 # This script was submitted to the PCRE project by Peter Kankowski as part of
10 # the upgrading of Unicode property support. The new code speeds up property
11 # matching many times. The script is for the use of PCRE maintainers, to
12 # generate the pcre_ucd.c file that contains a digested form of the Unicode
13 # data tables.
14 #
15 # The script should be run in the maint subdirectory, using the command
16 #
17 # ./MultiStage2.py >../pcre_ucd.c
18 #
19 # It requires four Unicode data tables, DerivedGeneralCategory.txt,
20 # GraphemeBreakProperty.txt, Scripts.txt, and CaseFolding.txt, to be in the
21 # Unicode.tables subdirectory. The first of these is found in the "extracted"
22 # subdirectory of the Unicode database (UCD) on the Unicode web site; the
23 # second is in the "auxiliary" subdirectory; the other two are directly in the
24 # UCD directory.
25 #
26 # Minor modifications made to this script:
27 # Added #! line at start
28 # Removed tabs
29 # Made it work with Python 2.4 by rewriting two statements that needed 2.5
30 # Consequent code tidy
31 # Adjusted data file names to take from the Unicode.tables directory
32 # Adjusted global table names by prefixing _pcre_.
33 # Commented out stuff relating to the casefolding table, which isn't used;
34 # removed completely in 2012.
35 # Corrected size calculation
36 # Add #ifndef SUPPORT_UCP to use dummy tables when no UCP support is needed.
37 #
38 # Major modifications made to this script:
39 # Added code to add a grapheme break property field to records.
40 #
41 # Added code to search for sets of more than two characters that must match
42 # each other caselessly. A new table is output containing these sets, and
43 # offsets into the table are added to the main output records. This new
44 # code scans CaseFolding.txt instead of UnicodeData.txt.
45 #
46 # The main tables generated by this script are used by macros defined in
47 # pcre_internal.h. They look up Unicode character properties using short
48 # sequences of code that contains no branches, which makes for greater speed.
49 #
50 # Conceptually, there is a table of records (of type ucd_record), containing a
51 # script number, character type, grapheme break type, offset to caseless
52 # matching set, and offset to the character's other case for every character.
53 # However, a real table covering all Unicode characters would be far too big.
54 # It can be efficiently compressed by observing that many characters have the
55 # same record, and many blocks of characters (taking 128 characters in a block)
56 # have the same set of records as other blocks. This leads to a 2-stage lookup
57 # process.
58 #
59 # This script constructs four tables. The ucd_caseless_sets table contains
60 # lists of characters that all match each other caselessly. Each list is
61 # in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
62 # any valid character. The first list is empty; this is used for characters
63 # that are not part of any list.
64 #
65 # The ucd_records table contains one instance of every unique record that is
66 # required. The ucd_stage1 table is indexed by a character's block number, and
67 # yields what is in effect a "virtual" block number. The ucd_stage2 table is a
68 # table of "virtual" blocks; each block is indexed by the offset of a character
69 # within its own block, and the result is the offset of the required record.
70 #
71 # Example: lowercase "a" (U+0061) is in block 0
72 # lookup 0 in stage1 table yields 0
73 # lookup 97 in the first table in stage2 yields 16
74 # record 17 is { 33, 5, 11, 0, -32 }
75 # 33 = ucp_Latin => Latin script
76 # 5 = ucp_Ll => Lower case letter
77 # 11 = ucp_gbOther => Grapheme break property "Other"
78 # 0 => not part of a caseless set
79 # -32 => Other case is U+0041
80 #
81 # Almost all lowercase latin characters resolve to the same record. One or two
82 # are different because they are part of a multi-character caseless set (for
83 # example, k, K and the Kelvin symbol are such a set).
84 #
85 # Example: hiragana letter A (U+3042) is in block 96 (0x60)
86 # lookup 96 in stage1 table yields 88
87 # lookup 66 in the 88th table in stage2 yields 467
88 # record 470 is { 26, 7, 11, 0, 0 }
89 # 26 = ucp_Hiragana => Hiragana script
90 # 7 = ucp_Lo => Other letter
91 # 11 = ucp_gbOther => Grapheme break property "Other"
92 # 0 => not part of a caseless set
93 # 0 => No other case
94 #
95 # In these examples, no other blocks resolve to the same "virtual" block, as it
96 # happens, but plenty of other blocks do share "virtual" blocks.
97 #
98 # There is a fourth table, maintained by hand, which translates from the
99 # individual character types such as ucp_Cc to the general types like ucp_C.
100 #
101 # Philip Hazel, 03 July 2008
102 #
103 # 01-March-2010: Updated list of scripts for Unicode 5.2.0
104 # 30-April-2011: Updated list of scripts for Unicode 6.0.0
105 # July-2012: Updated list of scripts for Unicode 6.1.0
106 # 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new
107 # field in the record to hold the value. Luckily, the
108 # structure had a hole in it, so the resulting table is
109 # not much bigger than before.
110 # 18-September-2012: Added code for multiple caseless sets. This uses the
111 # final hole in the structure.
112 # 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
113 ##############################################################################
114
115
116 import re
117 import string
118 import sys
119
120 MAX_UNICODE = 0x110000
121 NOTACHAR = 0xffffffff
122
123 # Parse a line of Scripts.txt, GraphemeBreakProperty.txt or DerivedGeneralCategory.txt
124 def make_get_names(enum):
125 return lambda chardata: enum.index(chardata[1])
126
127 # Parse a line of CaseFolding.txt
128 def get_other_case(chardata):
129 if chardata[1] == 'C' or chardata[1] == 'S':
130 return int(chardata[2], 16) - int(chardata[0], 16)
131 return 0
132
133
134 # Read the whole table in memory
135 def read_table(file_name, get_value, default_value):
136 file = open(file_name, 'r')
137 table = [default_value] * MAX_UNICODE
138 for line in file:
139 line = re.sub(r'#.*', '', line)
140 chardata = map(string.strip, line.split(';'))
141 if len(chardata) <= 1:
142 continue
143 value = get_value(chardata)
144 m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0])
145 char = int(m.group(1), 16)
146 if m.group(3) is None:
147 last = char
148 else:
149 last = int(m.group(3), 16)
150 for i in range(char, last + 1):
151 # It is important not to overwrite a previously set
152 # value because in the CaseFolding file there are lines
153 # to be ignored (returning the default value of 0)
154 # which often come after a line which has already set
155 # data.
156 if table[i] == default_value:
157 table[i] = value
158 file.close()
159 return table
160
161 # Get the smallest possible C language type for the values
162 def get_type_size(table):
163 type_size = [("pcre_uint8", 1), ("pcre_uint16", 2), ("pcre_uint32", 4),
164 ("signed char", 1), ("pcre_int16", 2), ("pcre_int32", 4)]
165 limits = [(0, 255), (0, 65535), (0, 4294967295),
166 (-128, 127), (-32768, 32767), (-2147483648, 2147483647)]
167 minval = min(table)
168 maxval = max(table)
169 for num, (minlimit, maxlimit) in enumerate(limits):
170 if minlimit <= minval and maxval <= maxlimit:
171 return type_size[num]
172 else:
173 raise OverflowError, "Too large to fit into C types"
174
175 def get_tables_size(*tables):
176 total_size = 0
177 for table in tables:
178 type, size = get_type_size(table)
179 total_size += size * len(table)
180 return total_size
181
182 # Compress the table into the two stages
183 def compress_table(table, block_size):
184 blocks = {} # Dictionary for finding identical blocks
185 stage1 = [] # Stage 1 table contains block numbers (indices into stage 2 table)
186 stage2 = [] # Stage 2 table contains the blocks with property values
187 table = tuple(table)
188 for i in range(0, len(table), block_size):
189 block = table[i:i+block_size]
190 start = blocks.get(block)
191 if start is None:
192 # Allocate a new block
193 start = len(stage2) / block_size
194 stage2 += block
195 blocks[block] = start
196 stage1.append(start)
197
198 return stage1, stage2
199
200 # Print a table
201 def print_table(table, table_name, block_size = None):
202 type, size = get_type_size(table)
203 ELEMS_PER_LINE = 16
204
205 s = "const %s %s[] = { /* %d bytes" % (type, table_name, size * len(table))
206 if block_size:
207 s += ", block = %d" % block_size
208 print s + " */"
209 table = tuple(table)
210 if block_size is None:
211 fmt = "%3d," * ELEMS_PER_LINE + " /* U+%04X */"
212 mult = MAX_UNICODE / len(table)
213 for i in range(0, len(table), ELEMS_PER_LINE):
214 print fmt % (table[i:i+ELEMS_PER_LINE] + (i * mult,))
215 else:
216 if block_size > ELEMS_PER_LINE:
217 el = ELEMS_PER_LINE
218 else:
219 el = block_size
220 fmt = "%3d," * el + "\n"
221 if block_size > ELEMS_PER_LINE:
222 fmt = fmt * (block_size / ELEMS_PER_LINE)
223 for i in range(0, len(table), block_size):
224 print ("/* block %d */\n" + fmt) % ((i / block_size,) + table[i:i+block_size])
225 print "};\n"
226
227 # Extract the unique combinations of properties into records
228 def combine_tables(*tables):
229 records = {}
230 index = []
231 for t in zip(*tables):
232 i = records.get(t)
233 if i is None:
234 i = records[t] = len(records)
235 index.append(i)
236 return index, records
237
238 def get_record_size_struct(records):
239 size = 0
240 structure = '/* When recompiling tables with a new Unicode version, please check the\n' + \
241 'types in this structure definition from pcre_internal.h (the actual\n' + \
242 'field names will be different):\n\ntypedef struct {\n'
243 for i in range(len(records[0])):
244 record_slice = map(lambda record: record[i], records)
245 slice_type, slice_size = get_type_size(record_slice)
246 # add padding: round up to the nearest power of slice_size
247 size = (size + slice_size - 1) & -slice_size
248 size += slice_size
249 structure += '%s property_%d;\n' % (slice_type, i)
250
251 # round up to the first item of the next structure in array
252 record_slice = map(lambda record: record[0], records)
253 slice_type, slice_size = get_type_size(record_slice)
254 size = (size + slice_size - 1) & -slice_size
255
256 structure += '} ucd_record;\n*/\n\n'
257 return size, structure
258
259 def test_record_size():
260 tests = [ \
261 ( [(3,), (6,), (6,), (1,)], 1 ), \
262 ( [(300,), (600,), (600,), (100,)], 2 ), \
263 ( [(25, 3), (6, 6), (34, 6), (68, 1)], 2 ), \
264 ( [(300, 3), (6, 6), (340, 6), (690, 1)], 4 ), \
265 ( [(3, 300), (6, 6), (6, 340), (1, 690)], 4 ), \
266 ( [(300, 300), (6, 6), (6, 340), (1, 690)], 4 ), \
267 ( [(3, 100000), (6, 6), (6, 123456), (1, 690)], 8 ), \
268 ( [(100000, 300), (6, 6), (123456, 6), (1, 690)], 8 ), \
269 ]
270 for test in tests:
271 size, struct = get_record_size_struct(test[0])
272 assert(size == test[1])
273 #print struct
274
275 def print_records(records, record_size):
276 print 'const ucd_record PRIV(ucd_records)[] = { ' + \
277 '/* %d bytes, record size %d */' % (len(records) * record_size, record_size)
278 records = zip(records.keys(), records.values())
279 records.sort(None, lambda x: x[1])
280 for i, record in enumerate(records):
281 print (' {' + '%6d, ' * len(record[0]) + '}, /* %3d */') % (record[0] + (i,))
282 print '};\n'
283
284 script_names = ['Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', \
285 'Cherokee', 'Common', 'Coptic', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Ethiopic', 'Georgian', \
286 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', \
287 'Inherited', 'Kannada', 'Katakana', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Limbu', 'Linear_B', 'Malayalam', \
288 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Ogham', 'Old_Italic', 'Old_Persian', 'Oriya', 'Osmanya', 'Runic', \
289 'Shavian', 'Sinhala', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tamil', 'Telugu', 'Thaana', \
290 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Yi', \
291 # New for Unicode 5.0
292 'Balinese', 'Cuneiform', 'Nko', 'Phags_Pa', 'Phoenician', \
293 # New for Unicode 5.1
294 'Carian', 'Cham', 'Kayah_Li', 'Lepcha', 'Lycian', 'Lydian', 'Ol_Chiki', 'Rejang', 'Saurashtra', 'Sundanese', 'Vai', \
295 # New for Unicode 5.2
296 'Avestan', 'Bamum', 'Egyptian_Hieroglyphs', 'Imperial_Aramaic', \
297 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', \
298 'Javanese', 'Kaithi', 'Lisu', 'Meetei_Mayek', \
299 'Old_South_Arabian', 'Old_Turkic', 'Samaritan', 'Tai_Tham', 'Tai_Viet', \
300 # New for Unicode 6.0.0
301 'Batak', 'Brahmi', 'Mandaic', \
302 # New for Unicode 6.1.0
303 'Chakma', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Sharada', 'Sora_Sompeng', 'Takri',
304 # New for Unicode 7.0.0
305 'Bassa_Vah', 'Caucasian_Albanian', 'Duployan', 'Elbasan', 'Grantha', 'Khojki', 'Khudawadi',
306 'Linear_A', 'Mahajani', 'Manichaean', 'Mende_Kikakui', 'Modi', 'Mro', 'Nabataean',
307 'Old_North_Arabian', 'Old_Permic', 'Pahawh_Hmong', 'Palmyrene', 'Psalter_Pahlavi',
308 'Pau_Cin_Hau', 'Siddham', 'Tirhuta', 'Warang_Citi'
309 ]
310
311 category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
312 'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
313 'Sc', 'Sk', 'Sm', 'So', 'Zl', 'Zp', 'Zs' ]
314
315 break_property_names = ['CR', 'LF', 'Control', 'Extend', 'Prepend',
316 'SpacingMark', 'L', 'V', 'T', 'LV', 'LVT', 'Regional_Indicator', 'Other' ]
317
318 test_record_size()
319
320 script = read_table('Unicode.tables/Scripts.txt', make_get_names(script_names), script_names.index('Common'))
321 category = read_table('Unicode.tables/DerivedGeneralCategory.txt', make_get_names(category_names), category_names.index('Cn'))
322 break_props = read_table('Unicode.tables/GraphemeBreakProperty.txt', make_get_names(break_property_names), break_property_names.index('Other'))
323 other_case = read_table('Unicode.tables/CaseFolding.txt', get_other_case, 0)
324
325
326 # This block of code was added by PH in September 2012. I am not a Python
327 # programmer, so the style is probably dreadful, but it does the job. It scans
328 # the other_case table to find sets of more than two characters that must all
329 # match each other caselessly. Later in this script a table of these sets is
330 # written out. However, we have to do this work here in order to compute the
331 # offsets in the table that are inserted into the main table.
332
333 # The CaseFolding.txt file lists pairs, but the common logic for reading data
334 # sets only one value, so first we go through the table and set "return"
335 # offsets for those that are not already set.
336
337 for c in range(0x10ffff):
338 if other_case[c] != 0 and other_case[c + other_case[c]] == 0:
339 other_case[c + other_case[c]] = -other_case[c]
340
341 # Now scan again and create equivalence sets.
342
343 sets = []
344
345 for c in range(0x10ffff):
346 o = c + other_case[c]
347
348 # Trigger when this character's other case does not point back here. We
349 # now have three characters that are case-equivalent.
350
351 if other_case[o] != -other_case[c]:
352 t = o + other_case[o]
353
354 # Scan the existing sets to see if any of the three characters are already
355 # part of a set. If so, unite the existing set with the new set.
356
357 appended = 0
358 for s in sets:
359 found = 0
360 for x in s:
361 if x == c or x == o or x == t:
362 found = 1
363
364 # Add new characters to an existing set
365
366 if found:
367 found = 0
368 for y in [c, o, t]:
369 for x in s:
370 if x == y:
371 found = 1
372 if not found:
373 s.append(y)
374 appended = 1
375
376 # If we have not added to an existing set, create a new one.
377
378 if not appended:
379 sets.append([c, o, t])
380
381 # End of loop looking for caseless sets.
382
383 # Now scan the sets and set appropriate offsets for the characters.
384
385 caseless_offsets = [0] * MAX_UNICODE
386
387 offset = 1;
388 for s in sets:
389 for x in s:
390 caseless_offsets[x] = offset
391 offset += len(s) + 1
392
393 # End of block of code for creating offsets for caseless matching sets.
394
395
396 # Combine the tables
397
398 table, records = combine_tables(script, category, break_props,
399 caseless_offsets, other_case)
400
401 record_size, record_struct = get_record_size_struct(records.keys())
402
403 # Find the optimum block size for the two-stage table
404 min_size = sys.maxint
405 for block_size in [2 ** i for i in range(5,10)]:
406 size = len(records) * record_size
407 stage1, stage2 = compress_table(table, block_size)
408 size += get_tables_size(stage1, stage2)
409 #print "/* block size %5d => %5d bytes */" % (block_size, size)
410 if size < min_size:
411 min_size = size
412 min_stage1, min_stage2 = stage1, stage2
413 min_block_size = block_size
414
415 print "/* This module is generated by the maint/MultiStage2.py script."
416 print "Do not modify it by hand. Instead modify the script and run it"
417 print "to regenerate this code."
418 print
419 print "As well as being part of the PCRE library, this module is #included"
420 print "by the pcretest program, which redefines the PRIV macro to change"
421 print "table names from _pcre_xxx to xxxx, thereby avoiding name clashes"
422 print "with the library. At present, just one of these tables is actually"
423 print "needed. */"
424 print
425 print "#ifndef PCRE_INCLUDED"
426 print
427 print "#ifdef HAVE_CONFIG_H"
428 print "#include \"config.h\""
429 print "#endif"
430 print
431 print "#include \"pcre_internal.h\""
432 print
433 print "#endif /* PCRE_INCLUDED */"
434 print
435 print "/* Unicode character database. */"
436 print "/* This file was autogenerated by the MultiStage2.py script. */"
437 print "/* Total size: %d bytes, block size: %d. */" % (min_size, min_block_size)
438 print
439 print "/* The tables herein are needed only when UCP support is built"
440 print "into PCRE. This module should not be referenced otherwise, so"
441 print "it should not matter whether it is compiled or not. However"
442 print "a comment was received about space saving - maybe the guy linked"
443 print "all the modules rather than using a library - so we include a"
444 print "condition to cut out the tables when not needed. But don't leave"
445 print "a totally empty module because some compilers barf at that."
446 print "Instead, just supply small dummy tables. */"
447 print
448 print "#ifndef SUPPORT_UCP"
449 print "const ucd_record PRIV(ucd_records)[] = {{0,0,0,0,0 }};"
450 print "const pcre_uint8 PRIV(ucd_stage1)[] = {0};"
451 print "const pcre_uint16 PRIV(ucd_stage2)[] = {0};"
452 print "const pcre_uint32 PRIV(ucd_caseless_sets)[] = {0};"
453 print "#else"
454 print
455 print "/* If the 32-bit library is run in non-32-bit mode, character values"
456 print "greater than 0x10ffff may be encountered. For these we set up a"
457 print "special record. */"
458 print
459 print "#ifdef COMPILE_PCRE32"
460 print "const ucd_record PRIV(dummy_ucd_record)[] = {{"
461 print " ucp_Common, /* script */"
462 print " ucp_Cn, /* type unassigned */"
463 print " ucp_gbOther, /* grapheme break property */"
464 print " 0, /* case set */"
465 print " 0, /* other case */"
466 print " }};"
467 print "#endif"
468 print
469 print record_struct
470
471 # --- Added by PH: output the table of caseless character sets ---
472
473 print "const pcre_uint32 PRIV(ucd_caseless_sets)[] = {"
474 print " NOTACHAR,"
475 for s in sets:
476 s = sorted(s)
477 for x in s:
478 print ' 0x%04x,' % x,
479 print ' NOTACHAR,'
480 print '};'
481 print
482
483 # ------
484
485 print "/* When #included in pcretest, we don't need this large table. */"
486 print
487 print "#ifndef PCRE_INCLUDED"
488 print
489 print_records(records, record_size)
490 print_table(min_stage1, 'PRIV(ucd_stage1)')
491 print_table(min_stage2, 'PRIV(ucd_stage2)', min_block_size)
492 print "#if UCD_BLOCK_SIZE != %d" % min_block_size
493 print "#error Please correct UCD_BLOCK_SIZE in pcre_internal.h"
494 print "#endif"
495 print "#endif /* SUPPORT_UCP */"
496 print
497 print "#endif /* PCRE_INCLUDED */"
498
499 """
500
501 # Three-stage tables:
502
503 # Find the optimum block size for 3-stage table
504 min_size = sys.maxint
505 for stage3_block in [2 ** i for i in range(2,6)]:
506 stage_i, stage3 = compress_table(table, stage3_block)
507 for stage2_block in [2 ** i for i in range(5,10)]:
508 size = len(records) * 4
509 stage1, stage2 = compress_table(stage_i, stage2_block)
510 size += get_tables_size(stage1, stage2, stage3)
511 # print "/* %5d / %3d => %5d bytes */" % (stage2_block, stage3_block, size)
512 if size < min_size:
513 min_size = size
514 min_stage1, min_stage2, min_stage3 = stage1, stage2, stage3
515 min_stage2_block, min_stage3_block = stage2_block, stage3_block
516
517 print "/* Total size: %d bytes" % min_size */
518 print_records(records)
519 print_table(min_stage1, 'ucd_stage1')
520 print_table(min_stage2, 'ucd_stage2', min_stage2_block)
521 print_table(min_stage3, 'ucd_stage3', min_stage3_block)
522
523 """

Properties

Name Value
svn:executable *

  ViewVC Help
Powered by ViewVC 1.1.5