/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1033 - (show annotations)
Mon Sep 10 11:02:48 2012 UTC (7 years, 3 months ago) by ph10
File MIME type: text/plain
File size: 126727 byte(s)
General spring-clean of EBCDIC-related issues in the code, which had decayed 
over time. Also the documentation. Added one test that can be run in an ASCII
world to do a little testing of EBCDIC-related things. 
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41 /* This module contains the external function pcre_dfa_exec(), which is an
42 alternative matching function that uses a sort of DFA algorithm (not a true
43 FSM). This is NOT Perl-compatible, but it has advantages in certain
44 applications. */
45
46
47 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
48 the performance of his patterns greatly. I could not use it as it stood, as it
49 was not thread safe, and made assumptions about pattern sizes. Also, it caused
50 test 7 to loop, and test 9 to crash with a segfault.
51
52 The issue is the check for duplicate states, which is done by a simple linear
53 search up the state list. (Grep for "duplicate" below to find the code.) For
54 many patterns, there will never be many states active at one time, so a simple
55 linear search is fine. In patterns that have many active states, it might be a
56 bottleneck. The suggested code used an indexing scheme to remember which states
57 had previously been used for each character, and avoided the linear search when
58 it knew there was no chance of a duplicate. This was implemented when adding
59 states to the state lists.
60
61 I wrote some thread-safe, not-limited code to try something similar at the time
62 of checking for duplicates (instead of when adding states), using index vectors
63 on the stack. It did give a 13% improvement with one specially constructed
64 pattern for certain subject strings, but on other strings and on many of the
65 simpler patterns in the test suite it did worse. The major problem, I think,
66 was the extra time to initialize the index. This had to be done for each call
67 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
68 only once - I suspect this was the cause of the problems with the tests.)
69
70 Overall, I concluded that the gains in some cases did not outweigh the losses
71 in others, so I abandoned this code. */
72
73
74
75 #ifdef HAVE_CONFIG_H
76 #include "config.h"
77 #endif
78
79 #define NLBLOCK md /* Block containing newline information */
80 #define PSSTART start_subject /* Field containing processed string start */
81 #define PSEND end_subject /* Field containing processed string end */
82
83 #include "pcre_internal.h"
84
85
86 /* For use to indent debugging output */
87
88 #define SP " "
89
90
91 /*************************************************
92 * Code parameters and static tables *
93 *************************************************/
94
95 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
96 into others, under special conditions. A gap of 20 between the blocks should be
97 enough. The resulting opcodes don't have to be less than 256 because they are
98 never stored, so we push them well clear of the normal opcodes. */
99
100 #define OP_PROP_EXTRA 300
101 #define OP_EXTUNI_EXTRA 320
102 #define OP_ANYNL_EXTRA 340
103 #define OP_HSPACE_EXTRA 360
104 #define OP_VSPACE_EXTRA 380
105
106
107 /* This table identifies those opcodes that are followed immediately by a
108 character that is to be tested in some way. This makes it possible to
109 centralize the loading of these characters. In the case of Type * etc, the
110 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
111 small value. Non-zero values in the table are the offsets from the opcode where
112 the character is to be found. ***NOTE*** If the start of this table is
113 modified, the three tables that follow must also be modified. */
114
115 static const pcre_uint8 coptable[] = {
116 0, /* End */
117 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
118 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
119 0, 0, 0, /* Any, AllAny, Anybyte */
120 0, 0, /* \P, \p */
121 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
122 0, /* \X */
123 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
124 1, /* Char */
125 1, /* Chari */
126 1, /* not */
127 1, /* noti */
128 /* Positive single-char repeats */
129 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
130 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
131 1+IMM2_SIZE, /* exact */
132 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
135 1+IMM2_SIZE, /* exact I */
136 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
137 /* Negative single-char repeats - only for chars < 256 */
138 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
139 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
140 1+IMM2_SIZE, /* NOT exact */
141 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
142 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
143 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
144 1+IMM2_SIZE, /* NOT exact I */
145 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
146 /* Positive type repeats */
147 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
148 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
149 1+IMM2_SIZE, /* Type exact */
150 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
151 /* Character class & ref repeats */
152 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
153 0, 0, /* CRRANGE, CRMINRANGE */
154 0, /* CLASS */
155 0, /* NCLASS */
156 0, /* XCLASS - variable length */
157 0, /* REF */
158 0, /* REFI */
159 0, /* RECURSE */
160 0, /* CALLOUT */
161 0, /* Alt */
162 0, /* Ket */
163 0, /* KetRmax */
164 0, /* KetRmin */
165 0, /* KetRpos */
166 0, /* Reverse */
167 0, /* Assert */
168 0, /* Assert not */
169 0, /* Assert behind */
170 0, /* Assert behind not */
171 0, 0, /* ONCE, ONCE_NC */
172 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
173 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
174 0, 0, /* CREF, NCREF */
175 0, 0, /* RREF, NRREF */
176 0, /* DEF */
177 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
178 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
179 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
180 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
181 0, 0 /* CLOSE, SKIPZERO */
182 };
183
184 /* This table identifies those opcodes that inspect a character. It is used to
185 remember the fact that a character could have been inspected when the end of
186 the subject is reached. ***NOTE*** If the start of this table is modified, the
187 two tables that follow must also be modified. */
188
189 static const pcre_uint8 poptable[] = {
190 0, /* End */
191 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
192 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
193 1, 1, 1, /* Any, AllAny, Anybyte */
194 1, 1, /* \P, \p */
195 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
196 1, /* \X */
197 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
198 1, /* Char */
199 1, /* Chari */
200 1, /* not */
201 1, /* noti */
202 /* Positive single-char repeats */
203 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
204 1, 1, 1, /* upto, minupto, exact */
205 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
206 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
207 1, 1, 1, /* upto I, minupto I, exact I */
208 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
209 /* Negative single-char repeats - only for chars < 256 */
210 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
211 1, 1, 1, /* NOT upto, minupto, exact */
212 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
213 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
214 1, 1, 1, /* NOT upto I, minupto I, exact I */
215 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
216 /* Positive type repeats */
217 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
218 1, 1, 1, /* Type upto, minupto, exact */
219 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
220 /* Character class & ref repeats */
221 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
222 1, 1, /* CRRANGE, CRMINRANGE */
223 1, /* CLASS */
224 1, /* NCLASS */
225 1, /* XCLASS - variable length */
226 0, /* REF */
227 0, /* REFI */
228 0, /* RECURSE */
229 0, /* CALLOUT */
230 0, /* Alt */
231 0, /* Ket */
232 0, /* KetRmax */
233 0, /* KetRmin */
234 0, /* KetRpos */
235 0, /* Reverse */
236 0, /* Assert */
237 0, /* Assert not */
238 0, /* Assert behind */
239 0, /* Assert behind not */
240 0, 0, /* ONCE, ONCE_NC */
241 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
242 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
243 0, 0, /* CREF, NCREF */
244 0, 0, /* RREF, NRREF */
245 0, /* DEF */
246 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
247 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
248 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
249 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
250 0, 0 /* CLOSE, SKIPZERO */
251 };
252
253 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
254 and \w */
255
256 static const pcre_uint8 toptable1[] = {
257 0, 0, 0, 0, 0, 0,
258 ctype_digit, ctype_digit,
259 ctype_space, ctype_space,
260 ctype_word, ctype_word,
261 0, 0 /* OP_ANY, OP_ALLANY */
262 };
263
264 static const pcre_uint8 toptable2[] = {
265 0, 0, 0, 0, 0, 0,
266 ctype_digit, 0,
267 ctype_space, 0,
268 ctype_word, 0,
269 1, 1 /* OP_ANY, OP_ALLANY */
270 };
271
272
273 /* Structure for holding data about a particular state, which is in effect the
274 current data for an active path through the match tree. It must consist
275 entirely of ints because the working vector we are passed, and which we put
276 these structures in, is a vector of ints. */
277
278 typedef struct stateblock {
279 int offset; /* Offset to opcode */
280 int count; /* Count for repeats */
281 int data; /* Some use extra data */
282 } stateblock;
283
284 #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
285
286
287 #ifdef PCRE_DEBUG
288 /*************************************************
289 * Print character string *
290 *************************************************/
291
292 /* Character string printing function for debugging.
293
294 Arguments:
295 p points to string
296 length number of bytes
297 f where to print
298
299 Returns: nothing
300 */
301
302 static void
303 pchars(const pcre_uchar *p, int length, FILE *f)
304 {
305 int c;
306 while (length-- > 0)
307 {
308 if (isprint(c = *(p++)))
309 fprintf(f, "%c", c);
310 else
311 fprintf(f, "\\x%02x", c);
312 }
313 }
314 #endif
315
316
317
318 /*************************************************
319 * Execute a Regular Expression - DFA engine *
320 *************************************************/
321
322 /* This internal function applies a compiled pattern to a subject string,
323 starting at a given point, using a DFA engine. This function is called from the
324 external one, possibly multiple times if the pattern is not anchored. The
325 function calls itself recursively for some kinds of subpattern.
326
327 Arguments:
328 md the match_data block with fixed information
329 this_start_code the opening bracket of this subexpression's code
330 current_subject where we currently are in the subject string
331 start_offset start offset in the subject string
332 offsets vector to contain the matching string offsets
333 offsetcount size of same
334 workspace vector of workspace
335 wscount size of same
336 rlevel function call recursion level
337
338 Returns: > 0 => number of match offset pairs placed in offsets
339 = 0 => offsets overflowed; longest matches are present
340 -1 => failed to match
341 < -1 => some kind of unexpected problem
342
343 The following macros are used for adding states to the two state vectors (one
344 for the current character, one for the following character). */
345
346 #define ADD_ACTIVE(x,y) \
347 if (active_count++ < wscount) \
348 { \
349 next_active_state->offset = (x); \
350 next_active_state->count = (y); \
351 next_active_state++; \
352 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
353 } \
354 else return PCRE_ERROR_DFA_WSSIZE
355
356 #define ADD_ACTIVE_DATA(x,y,z) \
357 if (active_count++ < wscount) \
358 { \
359 next_active_state->offset = (x); \
360 next_active_state->count = (y); \
361 next_active_state->data = (z); \
362 next_active_state++; \
363 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
364 } \
365 else return PCRE_ERROR_DFA_WSSIZE
366
367 #define ADD_NEW(x,y) \
368 if (new_count++ < wscount) \
369 { \
370 next_new_state->offset = (x); \
371 next_new_state->count = (y); \
372 next_new_state++; \
373 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
374 } \
375 else return PCRE_ERROR_DFA_WSSIZE
376
377 #define ADD_NEW_DATA(x,y,z) \
378 if (new_count++ < wscount) \
379 { \
380 next_new_state->offset = (x); \
381 next_new_state->count = (y); \
382 next_new_state->data = (z); \
383 next_new_state++; \
384 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d) line %d\n", rlevel*2-2, SP, \
385 (x), (y), (z), __LINE__)); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 BOOL reset_could_continue = FALSE;
428
429 rlevel++;
430 offsetcount &= (-2);
431
432 wscount -= 2;
433 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
434 (2 * INTS_PER_STATEBLOCK);
435
436 DPRINTF(("\n%.*s---------------------\n"
437 "%.*sCall to internal_dfa_exec f=%d\n",
438 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
439
440 ctypes = md->tables + ctypes_offset;
441 lcc = md->tables + lcc_offset;
442 fcc = md->tables + fcc_offset;
443
444 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
445
446 active_states = (stateblock *)(workspace + 2);
447 next_new_state = new_states = active_states + wscount;
448 new_count = 0;
449
450 first_op = this_start_code + 1 + LINK_SIZE +
451 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
452 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
453 ? IMM2_SIZE:0);
454
455 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
456 the alternative states onto the list, and find out where the end is. This
457 makes is possible to use this function recursively, when we want to stop at a
458 matching internal ket rather than at the end.
459
460 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
461 a backward assertion. In that case, we have to find out the maximum amount to
462 move back, and set up each alternative appropriately. */
463
464 if (*first_op == OP_REVERSE)
465 {
466 int max_back = 0;
467 int gone_back;
468
469 end_code = this_start_code;
470 do
471 {
472 int back = GET(end_code, 2+LINK_SIZE);
473 if (back > max_back) max_back = back;
474 end_code += GET(end_code, 1);
475 }
476 while (*end_code == OP_ALT);
477
478 /* If we can't go back the amount required for the longest lookbehind
479 pattern, go back as far as we can; some alternatives may still be viable. */
480
481 #ifdef SUPPORT_UTF
482 /* In character mode we have to step back character by character */
483
484 if (utf)
485 {
486 for (gone_back = 0; gone_back < max_back; gone_back++)
487 {
488 if (current_subject <= start_subject) break;
489 current_subject--;
490 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
491 }
492 }
493 else
494 #endif
495
496 /* In byte-mode we can do this quickly. */
497
498 {
499 gone_back = (current_subject - max_back < start_subject)?
500 (int)(current_subject - start_subject) : max_back;
501 current_subject -= gone_back;
502 }
503
504 /* Save the earliest consulted character */
505
506 if (current_subject < md->start_used_ptr)
507 md->start_used_ptr = current_subject;
508
509 /* Now we can process the individual branches. */
510
511 end_code = this_start_code;
512 do
513 {
514 int back = GET(end_code, 2+LINK_SIZE);
515 if (back <= gone_back)
516 {
517 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
518 ADD_NEW_DATA(-bstate, 0, gone_back - back);
519 }
520 end_code += GET(end_code, 1);
521 }
522 while (*end_code == OP_ALT);
523 }
524
525 /* This is the code for a "normal" subpattern (not a backward assertion). The
526 start of a whole pattern is always one of these. If we are at the top level,
527 we may be asked to restart matching from the same point that we reached for a
528 previous partial match. We still have to scan through the top-level branches to
529 find the end state. */
530
531 else
532 {
533 end_code = this_start_code;
534
535 /* Restarting */
536
537 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
538 {
539 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
540 new_count = workspace[1];
541 if (!workspace[0])
542 memcpy(new_states, active_states, new_count * sizeof(stateblock));
543 }
544
545 /* Not restarting */
546
547 else
548 {
549 int length = 1 + LINK_SIZE +
550 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
551 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
552 ? IMM2_SIZE:0);
553 do
554 {
555 ADD_NEW((int)(end_code - start_code + length), 0);
556 end_code += GET(end_code, 1);
557 length = 1 + LINK_SIZE;
558 }
559 while (*end_code == OP_ALT);
560 }
561 }
562
563 workspace[0] = 0; /* Bit indicating which vector is current */
564
565 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
566
567 /* Loop for scanning the subject */
568
569 ptr = current_subject;
570 for (;;)
571 {
572 int i, j;
573 int clen, dlen;
574 unsigned int c, d;
575 int forced_fail = 0;
576 BOOL partial_newline = FALSE;
577 BOOL could_continue = reset_could_continue;
578 reset_could_continue = FALSE;
579
580 /* Make the new state list into the active state list and empty the
581 new state list. */
582
583 temp_states = active_states;
584 active_states = new_states;
585 new_states = temp_states;
586 active_count = new_count;
587 new_count = 0;
588
589 workspace[0] ^= 1; /* Remember for the restarting feature */
590 workspace[1] = active_count;
591
592 #ifdef PCRE_DEBUG
593 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
594 pchars(ptr, STRLEN_UC(ptr), stdout);
595 printf("\"\n");
596
597 printf("%.*sActive states: ", rlevel*2-2, SP);
598 for (i = 0; i < active_count; i++)
599 printf("%d/%d ", active_states[i].offset, active_states[i].count);
600 printf("\n");
601 #endif
602
603 /* Set the pointers for adding new states */
604
605 next_active_state = active_states + active_count;
606 next_new_state = new_states;
607
608 /* Load the current character from the subject outside the loop, as many
609 different states may want to look at it, and we assume that at least one
610 will. */
611
612 if (ptr < end_subject)
613 {
614 clen = 1; /* Number of data items in the character */
615 #ifdef SUPPORT_UTF
616 if (utf) { GETCHARLEN(c, ptr, clen); } else
617 #endif /* SUPPORT_UTF */
618 c = *ptr;
619 }
620 else
621 {
622 clen = 0; /* This indicates the end of the subject */
623 c = NOTACHAR; /* This value should never actually be used */
624 }
625
626 /* Scan up the active states and act on each one. The result of an action
627 may be to add more states to the currently active list (e.g. on hitting a
628 parenthesis) or it may be to put states on the new list, for considering
629 when we move the character pointer on. */
630
631 for (i = 0; i < active_count; i++)
632 {
633 stateblock *current_state = active_states + i;
634 BOOL caseless = FALSE;
635 const pcre_uchar *code;
636 int state_offset = current_state->offset;
637 int count, codevalue, rrc;
638
639 #ifdef PCRE_DEBUG
640 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
641 if (clen == 0) printf("EOL\n");
642 else if (c > 32 && c < 127) printf("'%c'\n", c);
643 else printf("0x%02x\n", c);
644 #endif
645
646 /* A negative offset is a special case meaning "hold off going to this
647 (negated) state until the number of characters in the data field have
648 been skipped". If the could_continue flag was passed over from a previous
649 state, arrange for it to passed on. */
650
651 if (state_offset < 0)
652 {
653 if (current_state->data > 0)
654 {
655 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
656 ADD_NEW_DATA(state_offset, current_state->count,
657 current_state->data - 1);
658 if (could_continue) reset_could_continue = TRUE;
659 continue;
660 }
661 else
662 {
663 current_state->offset = state_offset = -state_offset;
664 }
665 }
666
667 /* Check for a duplicate state with the same count, and skip if found.
668 See the note at the head of this module about the possibility of improving
669 performance here. */
670
671 for (j = 0; j < i; j++)
672 {
673 if (active_states[j].offset == state_offset &&
674 active_states[j].count == current_state->count)
675 {
676 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
677 goto NEXT_ACTIVE_STATE;
678 }
679 }
680
681 /* The state offset is the offset to the opcode */
682
683 code = start_code + state_offset;
684 codevalue = *code;
685
686 /* If this opcode inspects a character, but we are at the end of the
687 subject, remember the fact for use when testing for a partial match. */
688
689 if (clen == 0 && poptable[codevalue] != 0)
690 could_continue = TRUE;
691
692 /* If this opcode is followed by an inline character, load it. It is
693 tempting to test for the presence of a subject character here, but that
694 is wrong, because sometimes zero repetitions of the subject are
695 permitted.
696
697 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
698 argument that is not a data character - but is always one byte long because
699 the values are small. We have to take special action to deal with \P, \p,
700 \H, \h, \V, \v and \X in this case. To keep the other cases fast, convert
701 these ones to new opcodes. */
702
703 if (coptable[codevalue] > 0)
704 {
705 dlen = 1;
706 #ifdef SUPPORT_UTF
707 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
708 #endif /* SUPPORT_UTF */
709 d = code[coptable[codevalue]];
710 if (codevalue >= OP_TYPESTAR)
711 {
712 switch(d)
713 {
714 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
715 case OP_NOTPROP:
716 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
717 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
718 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
719 case OP_NOT_HSPACE:
720 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
721 case OP_NOT_VSPACE:
722 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
723 default: break;
724 }
725 }
726 }
727 else
728 {
729 dlen = 0; /* Not strictly necessary, but compilers moan */
730 d = NOTACHAR; /* if these variables are not set. */
731 }
732
733
734 /* Now process the individual opcodes */
735
736 switch (codevalue)
737 {
738 /* ========================================================================== */
739 /* These cases are never obeyed. This is a fudge that causes a compile-
740 time error if the vectors coptable or poptable, which are indexed by
741 opcode, are not the correct length. It seems to be the only way to do
742 such a check at compile time, as the sizeof() operator does not work
743 in the C preprocessor. */
744
745 case OP_TABLE_LENGTH:
746 case OP_TABLE_LENGTH +
747 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
748 (sizeof(poptable) == OP_TABLE_LENGTH)):
749 break;
750
751 /* ========================================================================== */
752 /* Reached a closing bracket. If not at the end of the pattern, carry
753 on with the next opcode. For repeating opcodes, also add the repeat
754 state. Note that KETRPOS will always be encountered at the end of the
755 subpattern, because the possessive subpattern repeats are always handled
756 using recursive calls. Thus, it never adds any new states.
757
758 At the end of the (sub)pattern, unless we have an empty string and
759 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
760 start of the subject, save the match data, shifting up all previous
761 matches so we always have the longest first. */
762
763 case OP_KET:
764 case OP_KETRMIN:
765 case OP_KETRMAX:
766 case OP_KETRPOS:
767 if (code != end_code)
768 {
769 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
770 if (codevalue != OP_KET)
771 {
772 ADD_ACTIVE(state_offset - GET(code, 1), 0);
773 }
774 }
775 else
776 {
777 if (ptr > current_subject ||
778 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
779 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
780 current_subject > start_subject + md->start_offset)))
781 {
782 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
783 else if (match_count > 0 && ++match_count * 2 > offsetcount)
784 match_count = 0;
785 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
786 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
787 if (offsetcount >= 2)
788 {
789 offsets[0] = (int)(current_subject - start_subject);
790 offsets[1] = (int)(ptr - start_subject);
791 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
792 offsets[1] - offsets[0], (char *)current_subject));
793 }
794 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
795 {
796 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
797 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
798 match_count, rlevel*2-2, SP));
799 return match_count;
800 }
801 }
802 }
803 break;
804
805 /* ========================================================================== */
806 /* These opcodes add to the current list of states without looking
807 at the current character. */
808
809 /*-----------------------------------------------------------------*/
810 case OP_ALT:
811 do { code += GET(code, 1); } while (*code == OP_ALT);
812 ADD_ACTIVE((int)(code - start_code), 0);
813 break;
814
815 /*-----------------------------------------------------------------*/
816 case OP_BRA:
817 case OP_SBRA:
818 do
819 {
820 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
821 code += GET(code, 1);
822 }
823 while (*code == OP_ALT);
824 break;
825
826 /*-----------------------------------------------------------------*/
827 case OP_CBRA:
828 case OP_SCBRA:
829 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
830 code += GET(code, 1);
831 while (*code == OP_ALT)
832 {
833 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
834 code += GET(code, 1);
835 }
836 break;
837
838 /*-----------------------------------------------------------------*/
839 case OP_BRAZERO:
840 case OP_BRAMINZERO:
841 ADD_ACTIVE(state_offset + 1, 0);
842 code += 1 + GET(code, 2);
843 while (*code == OP_ALT) code += GET(code, 1);
844 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
845 break;
846
847 /*-----------------------------------------------------------------*/
848 case OP_SKIPZERO:
849 code += 1 + GET(code, 2);
850 while (*code == OP_ALT) code += GET(code, 1);
851 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
852 break;
853
854 /*-----------------------------------------------------------------*/
855 case OP_CIRC:
856 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
857 { ADD_ACTIVE(state_offset + 1, 0); }
858 break;
859
860 /*-----------------------------------------------------------------*/
861 case OP_CIRCM:
862 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
863 (ptr != end_subject && WAS_NEWLINE(ptr)))
864 { ADD_ACTIVE(state_offset + 1, 0); }
865 break;
866
867 /*-----------------------------------------------------------------*/
868 case OP_EOD:
869 if (ptr >= end_subject)
870 {
871 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
872 could_continue = TRUE;
873 else { ADD_ACTIVE(state_offset + 1, 0); }
874 }
875 break;
876
877 /*-----------------------------------------------------------------*/
878 case OP_SOD:
879 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
880 break;
881
882 /*-----------------------------------------------------------------*/
883 case OP_SOM:
884 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
885 break;
886
887
888 /* ========================================================================== */
889 /* These opcodes inspect the next subject character, and sometimes
890 the previous one as well, but do not have an argument. The variable
891 clen contains the length of the current character and is zero if we are
892 at the end of the subject. */
893
894 /*-----------------------------------------------------------------*/
895 case OP_ANY:
896 if (clen > 0 && !IS_NEWLINE(ptr))
897 {
898 if (ptr + 1 >= md->end_subject &&
899 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
900 NLBLOCK->nltype == NLTYPE_FIXED &&
901 NLBLOCK->nllen == 2 &&
902 c == NLBLOCK->nl[0])
903 {
904 could_continue = partial_newline = TRUE;
905 }
906 else
907 {
908 ADD_NEW(state_offset + 1, 0);
909 }
910 }
911 break;
912
913 /*-----------------------------------------------------------------*/
914 case OP_ALLANY:
915 if (clen > 0)
916 { ADD_NEW(state_offset + 1, 0); }
917 break;
918
919 /*-----------------------------------------------------------------*/
920 case OP_EODN:
921 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
922 could_continue = TRUE;
923 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
924 { ADD_ACTIVE(state_offset + 1, 0); }
925 break;
926
927 /*-----------------------------------------------------------------*/
928 case OP_DOLL:
929 if ((md->moptions & PCRE_NOTEOL) == 0)
930 {
931 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
932 could_continue = TRUE;
933 else if (clen == 0 ||
934 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
935 (ptr == end_subject - md->nllen)
936 ))
937 { ADD_ACTIVE(state_offset + 1, 0); }
938 else if (ptr + 1 >= md->end_subject &&
939 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
940 NLBLOCK->nltype == NLTYPE_FIXED &&
941 NLBLOCK->nllen == 2 &&
942 c == NLBLOCK->nl[0])
943 {
944 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
945 {
946 reset_could_continue = TRUE;
947 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
948 }
949 else could_continue = partial_newline = TRUE;
950 }
951 }
952 break;
953
954 /*-----------------------------------------------------------------*/
955 case OP_DOLLM:
956 if ((md->moptions & PCRE_NOTEOL) == 0)
957 {
958 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
959 could_continue = TRUE;
960 else if (clen == 0 ||
961 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
962 { ADD_ACTIVE(state_offset + 1, 0); }
963 else if (ptr + 1 >= md->end_subject &&
964 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
965 NLBLOCK->nltype == NLTYPE_FIXED &&
966 NLBLOCK->nllen == 2 &&
967 c == NLBLOCK->nl[0])
968 {
969 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
970 {
971 reset_could_continue = TRUE;
972 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
973 }
974 else could_continue = partial_newline = TRUE;
975 }
976 }
977 else if (IS_NEWLINE(ptr))
978 { ADD_ACTIVE(state_offset + 1, 0); }
979 break;
980
981 /*-----------------------------------------------------------------*/
982
983 case OP_DIGIT:
984 case OP_WHITESPACE:
985 case OP_WORDCHAR:
986 if (clen > 0 && c < 256 &&
987 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
988 { ADD_NEW(state_offset + 1, 0); }
989 break;
990
991 /*-----------------------------------------------------------------*/
992 case OP_NOT_DIGIT:
993 case OP_NOT_WHITESPACE:
994 case OP_NOT_WORDCHAR:
995 if (clen > 0 && (c >= 256 ||
996 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
997 { ADD_NEW(state_offset + 1, 0); }
998 break;
999
1000 /*-----------------------------------------------------------------*/
1001 case OP_WORD_BOUNDARY:
1002 case OP_NOT_WORD_BOUNDARY:
1003 {
1004 int left_word, right_word;
1005
1006 if (ptr > start_subject)
1007 {
1008 const pcre_uchar *temp = ptr - 1;
1009 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
1010 #ifdef SUPPORT_UTF
1011 if (utf) { BACKCHAR(temp); }
1012 #endif
1013 GETCHARTEST(d, temp);
1014 #ifdef SUPPORT_UCP
1015 if ((md->poptions & PCRE_UCP) != 0)
1016 {
1017 if (d == '_') left_word = TRUE; else
1018 {
1019 int cat = UCD_CATEGORY(d);
1020 left_word = (cat == ucp_L || cat == ucp_N);
1021 }
1022 }
1023 else
1024 #endif
1025 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1026 }
1027 else left_word = FALSE;
1028
1029 if (clen > 0)
1030 {
1031 #ifdef SUPPORT_UCP
1032 if ((md->poptions & PCRE_UCP) != 0)
1033 {
1034 if (c == '_') right_word = TRUE; else
1035 {
1036 int cat = UCD_CATEGORY(c);
1037 right_word = (cat == ucp_L || cat == ucp_N);
1038 }
1039 }
1040 else
1041 #endif
1042 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1043 }
1044 else right_word = FALSE;
1045
1046 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1047 { ADD_ACTIVE(state_offset + 1, 0); }
1048 }
1049 break;
1050
1051
1052 /*-----------------------------------------------------------------*/
1053 /* Check the next character by Unicode property. We will get here only
1054 if the support is in the binary; otherwise a compile-time error occurs.
1055 */
1056
1057 #ifdef SUPPORT_UCP
1058 case OP_PROP:
1059 case OP_NOTPROP:
1060 if (clen > 0)
1061 {
1062 BOOL OK;
1063 const ucd_record * prop = GET_UCD(c);
1064 switch(code[1])
1065 {
1066 case PT_ANY:
1067 OK = TRUE;
1068 break;
1069
1070 case PT_LAMP:
1071 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1072 prop->chartype == ucp_Lt;
1073 break;
1074
1075 case PT_GC:
1076 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1077 break;
1078
1079 case PT_PC:
1080 OK = prop->chartype == code[2];
1081 break;
1082
1083 case PT_SC:
1084 OK = prop->script == code[2];
1085 break;
1086
1087 /* These are specials for combination cases. */
1088
1089 case PT_ALNUM:
1090 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1091 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1092 break;
1093
1094 case PT_SPACE: /* Perl space */
1095 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1096 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1097 break;
1098
1099 case PT_PXSPACE: /* POSIX space */
1100 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1101 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1102 c == CHAR_FF || c == CHAR_CR;
1103 break;
1104
1105 case PT_WORD:
1106 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1107 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1108 c == CHAR_UNDERSCORE;
1109 break;
1110
1111 /* Should never occur, but keep compilers from grumbling. */
1112
1113 default:
1114 OK = codevalue != OP_PROP;
1115 break;
1116 }
1117
1118 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1119 }
1120 break;
1121 #endif
1122
1123
1124
1125 /* ========================================================================== */
1126 /* These opcodes likewise inspect the subject character, but have an
1127 argument that is not a data character. It is one of these opcodes:
1128 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1129 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1130
1131 case OP_TYPEPLUS:
1132 case OP_TYPEMINPLUS:
1133 case OP_TYPEPOSPLUS:
1134 count = current_state->count; /* Already matched */
1135 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1136 if (clen > 0)
1137 {
1138 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1139 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1140 NLBLOCK->nltype == NLTYPE_FIXED &&
1141 NLBLOCK->nllen == 2 &&
1142 c == NLBLOCK->nl[0])
1143 {
1144 could_continue = partial_newline = TRUE;
1145 }
1146 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1147 (c < 256 &&
1148 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1149 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1150 {
1151 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1152 {
1153 active_count--; /* Remove non-match possibility */
1154 next_active_state--;
1155 }
1156 count++;
1157 ADD_NEW(state_offset, count);
1158 }
1159 }
1160 break;
1161
1162 /*-----------------------------------------------------------------*/
1163 case OP_TYPEQUERY:
1164 case OP_TYPEMINQUERY:
1165 case OP_TYPEPOSQUERY:
1166 ADD_ACTIVE(state_offset + 2, 0);
1167 if (clen > 0)
1168 {
1169 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1170 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1171 NLBLOCK->nltype == NLTYPE_FIXED &&
1172 NLBLOCK->nllen == 2 &&
1173 c == NLBLOCK->nl[0])
1174 {
1175 could_continue = partial_newline = TRUE;
1176 }
1177 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1178 (c < 256 &&
1179 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1180 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1181 {
1182 if (codevalue == OP_TYPEPOSQUERY)
1183 {
1184 active_count--; /* Remove non-match possibility */
1185 next_active_state--;
1186 }
1187 ADD_NEW(state_offset + 2, 0);
1188 }
1189 }
1190 break;
1191
1192 /*-----------------------------------------------------------------*/
1193 case OP_TYPESTAR:
1194 case OP_TYPEMINSTAR:
1195 case OP_TYPEPOSSTAR:
1196 ADD_ACTIVE(state_offset + 2, 0);
1197 if (clen > 0)
1198 {
1199 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1200 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1201 NLBLOCK->nltype == NLTYPE_FIXED &&
1202 NLBLOCK->nllen == 2 &&
1203 c == NLBLOCK->nl[0])
1204 {
1205 could_continue = partial_newline = TRUE;
1206 }
1207 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1208 (c < 256 &&
1209 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1210 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1211 {
1212 if (codevalue == OP_TYPEPOSSTAR)
1213 {
1214 active_count--; /* Remove non-match possibility */
1215 next_active_state--;
1216 }
1217 ADD_NEW(state_offset, 0);
1218 }
1219 }
1220 break;
1221
1222 /*-----------------------------------------------------------------*/
1223 case OP_TYPEEXACT:
1224 count = current_state->count; /* Number already matched */
1225 if (clen > 0)
1226 {
1227 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1228 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1229 NLBLOCK->nltype == NLTYPE_FIXED &&
1230 NLBLOCK->nllen == 2 &&
1231 c == NLBLOCK->nl[0])
1232 {
1233 could_continue = partial_newline = TRUE;
1234 }
1235 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1236 (c < 256 &&
1237 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1238 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1239 {
1240 if (++count >= GET2(code, 1))
1241 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1242 else
1243 { ADD_NEW(state_offset, count); }
1244 }
1245 }
1246 break;
1247
1248 /*-----------------------------------------------------------------*/
1249 case OP_TYPEUPTO:
1250 case OP_TYPEMINUPTO:
1251 case OP_TYPEPOSUPTO:
1252 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1253 count = current_state->count; /* Number already matched */
1254 if (clen > 0)
1255 {
1256 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1257 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1258 NLBLOCK->nltype == NLTYPE_FIXED &&
1259 NLBLOCK->nllen == 2 &&
1260 c == NLBLOCK->nl[0])
1261 {
1262 could_continue = partial_newline = TRUE;
1263 }
1264 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1265 (c < 256 &&
1266 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1267 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1268 {
1269 if (codevalue == OP_TYPEPOSUPTO)
1270 {
1271 active_count--; /* Remove non-match possibility */
1272 next_active_state--;
1273 }
1274 if (++count >= GET2(code, 1))
1275 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1276 else
1277 { ADD_NEW(state_offset, count); }
1278 }
1279 }
1280 break;
1281
1282 /* ========================================================================== */
1283 /* These are virtual opcodes that are used when something like
1284 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1285 argument. It keeps the code above fast for the other cases. The argument
1286 is in the d variable. */
1287
1288 #ifdef SUPPORT_UCP
1289 case OP_PROP_EXTRA + OP_TYPEPLUS:
1290 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1291 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1292 count = current_state->count; /* Already matched */
1293 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1294 if (clen > 0)
1295 {
1296 BOOL OK;
1297 const ucd_record * prop = GET_UCD(c);
1298 switch(code[2])
1299 {
1300 case PT_ANY:
1301 OK = TRUE;
1302 break;
1303
1304 case PT_LAMP:
1305 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1306 prop->chartype == ucp_Lt;
1307 break;
1308
1309 case PT_GC:
1310 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1311 break;
1312
1313 case PT_PC:
1314 OK = prop->chartype == code[3];
1315 break;
1316
1317 case PT_SC:
1318 OK = prop->script == code[3];
1319 break;
1320
1321 /* These are specials for combination cases. */
1322
1323 case PT_ALNUM:
1324 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1325 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1326 break;
1327
1328 case PT_SPACE: /* Perl space */
1329 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1330 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1331 break;
1332
1333 case PT_PXSPACE: /* POSIX space */
1334 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1335 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1336 c == CHAR_FF || c == CHAR_CR;
1337 break;
1338
1339 case PT_WORD:
1340 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1341 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1342 c == CHAR_UNDERSCORE;
1343 break;
1344
1345 /* Should never occur, but keep compilers from grumbling. */
1346
1347 default:
1348 OK = codevalue != OP_PROP;
1349 break;
1350 }
1351
1352 if (OK == (d == OP_PROP))
1353 {
1354 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1355 {
1356 active_count--; /* Remove non-match possibility */
1357 next_active_state--;
1358 }
1359 count++;
1360 ADD_NEW(state_offset, count);
1361 }
1362 }
1363 break;
1364
1365 /*-----------------------------------------------------------------*/
1366 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1367 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1368 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1369 count = current_state->count; /* Already matched */
1370 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1371 if (clen > 0)
1372 {
1373 int lgb, rgb;
1374 const pcre_uchar *nptr = ptr + clen;
1375 int ncount = 0;
1376 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1377 {
1378 active_count--; /* Remove non-match possibility */
1379 next_active_state--;
1380 }
1381 lgb = UCD_GRAPHBREAK(c);
1382 while (nptr < end_subject)
1383 {
1384 dlen = 1;
1385 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1386 rgb = UCD_GRAPHBREAK(d);
1387 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1388 ncount++;
1389 lgb = rgb;
1390 nptr += dlen;
1391 }
1392 count++;
1393 ADD_NEW_DATA(-state_offset, count, ncount);
1394 }
1395 break;
1396 #endif
1397
1398 /*-----------------------------------------------------------------*/
1399 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1400 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1401 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1402 count = current_state->count; /* Already matched */
1403 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1404 if (clen > 0)
1405 {
1406 int ncount = 0;
1407 switch (c)
1408 {
1409 case CHAR_VT:
1410 case CHAR_FF:
1411 case CHAR_NEL:
1412 #ifndef EBCDIC
1413 case 0x2028:
1414 case 0x2029:
1415 #endif /* Not EBCDIC */
1416 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1417 goto ANYNL01;
1418
1419 case CHAR_CR:
1420 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1421 /* Fall through */
1422
1423 ANYNL01:
1424 case CHAR_LF:
1425 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1426 {
1427 active_count--; /* Remove non-match possibility */
1428 next_active_state--;
1429 }
1430 count++;
1431 ADD_NEW_DATA(-state_offset, count, ncount);
1432 break;
1433
1434 default:
1435 break;
1436 }
1437 }
1438 break;
1439
1440 /*-----------------------------------------------------------------*/
1441 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1442 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1443 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1444 count = current_state->count; /* Already matched */
1445 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1446 if (clen > 0)
1447 {
1448 BOOL OK;
1449 switch (c)
1450 {
1451 case CHAR_LF:
1452 case CHAR_VT:
1453 case CHAR_FF:
1454 case CHAR_CR:
1455 case CHAR_NEL:
1456 #ifndef EBCDIC
1457 case 0x2028:
1458 case 0x2029:
1459 #endif /* Not EBCDIC */
1460 OK = TRUE;
1461 break;
1462
1463 default:
1464 OK = FALSE;
1465 break;
1466 }
1467
1468 if (OK == (d == OP_VSPACE))
1469 {
1470 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1471 {
1472 active_count--; /* Remove non-match possibility */
1473 next_active_state--;
1474 }
1475 count++;
1476 ADD_NEW_DATA(-state_offset, count, 0);
1477 }
1478 }
1479 break;
1480
1481 /*-----------------------------------------------------------------*/
1482 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1483 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1484 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1485 count = current_state->count; /* Already matched */
1486 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1487 if (clen > 0)
1488 {
1489 BOOL OK;
1490 switch (c)
1491 {
1492 case CHAR_HT:
1493 case CHAR_SPACE:
1494 #ifndef EBCDIC
1495 case 0xa0: /* NBSP */
1496 case 0x1680: /* OGHAM SPACE MARK */
1497 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1498 case 0x2000: /* EN QUAD */
1499 case 0x2001: /* EM QUAD */
1500 case 0x2002: /* EN SPACE */
1501 case 0x2003: /* EM SPACE */
1502 case 0x2004: /* THREE-PER-EM SPACE */
1503 case 0x2005: /* FOUR-PER-EM SPACE */
1504 case 0x2006: /* SIX-PER-EM SPACE */
1505 case 0x2007: /* FIGURE SPACE */
1506 case 0x2008: /* PUNCTUATION SPACE */
1507 case 0x2009: /* THIN SPACE */
1508 case 0x200A: /* HAIR SPACE */
1509 case 0x202f: /* NARROW NO-BREAK SPACE */
1510 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1511 case 0x3000: /* IDEOGRAPHIC SPACE */
1512 #endif /* Not EBCDIC */
1513 OK = TRUE;
1514 break;
1515
1516 default:
1517 OK = FALSE;
1518 break;
1519 }
1520
1521 if (OK == (d == OP_HSPACE))
1522 {
1523 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1524 {
1525 active_count--; /* Remove non-match possibility */
1526 next_active_state--;
1527 }
1528 count++;
1529 ADD_NEW_DATA(-state_offset, count, 0);
1530 }
1531 }
1532 break;
1533
1534 /*-----------------------------------------------------------------*/
1535 #ifdef SUPPORT_UCP
1536 case OP_PROP_EXTRA + OP_TYPEQUERY:
1537 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1538 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1539 count = 4;
1540 goto QS1;
1541
1542 case OP_PROP_EXTRA + OP_TYPESTAR:
1543 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1544 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1545 count = 0;
1546
1547 QS1:
1548
1549 ADD_ACTIVE(state_offset + 4, 0);
1550 if (clen > 0)
1551 {
1552 BOOL OK;
1553 const ucd_record * prop = GET_UCD(c);
1554 switch(code[2])
1555 {
1556 case PT_ANY:
1557 OK = TRUE;
1558 break;
1559
1560 case PT_LAMP:
1561 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1562 prop->chartype == ucp_Lt;
1563 break;
1564
1565 case PT_GC:
1566 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1567 break;
1568
1569 case PT_PC:
1570 OK = prop->chartype == code[3];
1571 break;
1572
1573 case PT_SC:
1574 OK = prop->script == code[3];
1575 break;
1576
1577 /* These are specials for combination cases. */
1578
1579 case PT_ALNUM:
1580 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1581 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1582 break;
1583
1584 case PT_SPACE: /* Perl space */
1585 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1586 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1587 break;
1588
1589 case PT_PXSPACE: /* POSIX space */
1590 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1591 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1592 c == CHAR_FF || c == CHAR_CR;
1593 break;
1594
1595 case PT_WORD:
1596 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1597 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1598 c == CHAR_UNDERSCORE;
1599 break;
1600
1601 /* Should never occur, but keep compilers from grumbling. */
1602
1603 default:
1604 OK = codevalue != OP_PROP;
1605 break;
1606 }
1607
1608 if (OK == (d == OP_PROP))
1609 {
1610 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1611 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1612 {
1613 active_count--; /* Remove non-match possibility */
1614 next_active_state--;
1615 }
1616 ADD_NEW(state_offset + count, 0);
1617 }
1618 }
1619 break;
1620
1621 /*-----------------------------------------------------------------*/
1622 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1623 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1624 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1625 count = 2;
1626 goto QS2;
1627
1628 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1629 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1630 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1631 count = 0;
1632
1633 QS2:
1634
1635 ADD_ACTIVE(state_offset + 2, 0);
1636 if (clen > 0)
1637 {
1638 int lgb, rgb;
1639 const pcre_uchar *nptr = ptr + clen;
1640 int ncount = 0;
1641 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1642 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1643 {
1644 active_count--; /* Remove non-match possibility */
1645 next_active_state--;
1646 }
1647 lgb = UCD_GRAPHBREAK(c);
1648 while (nptr < end_subject)
1649 {
1650 dlen = 1;
1651 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1652 rgb = UCD_GRAPHBREAK(d);
1653 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1654 ncount++;
1655 lgb = rgb;
1656 nptr += dlen;
1657 }
1658 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1659 }
1660 break;
1661 #endif
1662
1663 /*-----------------------------------------------------------------*/
1664 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1665 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1666 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1667 count = 2;
1668 goto QS3;
1669
1670 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1671 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1672 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1673 count = 0;
1674
1675 QS3:
1676 ADD_ACTIVE(state_offset + 2, 0);
1677 if (clen > 0)
1678 {
1679 int ncount = 0;
1680 switch (c)
1681 {
1682 case CHAR_VT:
1683 case CHAR_FF:
1684 case CHAR_NEL:
1685 #ifndef EBCDIC
1686 case 0x2028:
1687 case 0x2029:
1688 #endif /* Not EBCDIC */
1689 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1690 goto ANYNL02;
1691
1692 case CHAR_CR:
1693 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1694 /* Fall through */
1695
1696 ANYNL02:
1697 case CHAR_LF:
1698 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1699 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1700 {
1701 active_count--; /* Remove non-match possibility */
1702 next_active_state--;
1703 }
1704 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1705 break;
1706
1707 default:
1708 break;
1709 }
1710 }
1711 break;
1712
1713 /*-----------------------------------------------------------------*/
1714 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1715 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1716 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1717 count = 2;
1718 goto QS4;
1719
1720 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1721 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1722 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1723 count = 0;
1724
1725 QS4:
1726 ADD_ACTIVE(state_offset + 2, 0);
1727 if (clen > 0)
1728 {
1729 BOOL OK;
1730 switch (c)
1731 {
1732 case CHAR_LF:
1733 case CHAR_VT:
1734 case CHAR_FF:
1735 case CHAR_CR:
1736 case CHAR_NEL:
1737 #ifndef EBCDIC
1738 case 0x2028:
1739 case 0x2029:
1740 #endif /* Not EBCDIC */
1741 OK = TRUE;
1742 break;
1743
1744 default:
1745 OK = FALSE;
1746 break;
1747 }
1748 if (OK == (d == OP_VSPACE))
1749 {
1750 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1751 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1752 {
1753 active_count--; /* Remove non-match possibility */
1754 next_active_state--;
1755 }
1756 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1757 }
1758 }
1759 break;
1760
1761 /*-----------------------------------------------------------------*/
1762 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1763 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1764 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1765 count = 2;
1766 goto QS5;
1767
1768 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1769 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1770 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1771 count = 0;
1772
1773 QS5:
1774 ADD_ACTIVE(state_offset + 2, 0);
1775 if (clen > 0)
1776 {
1777 BOOL OK;
1778 switch (c)
1779 {
1780 case CHAR_HT:
1781 case CHAR_SPACE:
1782 #ifndef EBCDIC
1783 case 0xa0: /* NBSP */
1784 case 0x1680: /* OGHAM SPACE MARK */
1785 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1786 case 0x2000: /* EN QUAD */
1787 case 0x2001: /* EM QUAD */
1788 case 0x2002: /* EN SPACE */
1789 case 0x2003: /* EM SPACE */
1790 case 0x2004: /* THREE-PER-EM SPACE */
1791 case 0x2005: /* FOUR-PER-EM SPACE */
1792 case 0x2006: /* SIX-PER-EM SPACE */
1793 case 0x2007: /* FIGURE SPACE */
1794 case 0x2008: /* PUNCTUATION SPACE */
1795 case 0x2009: /* THIN SPACE */
1796 case 0x200A: /* HAIR SPACE */
1797 case 0x202f: /* NARROW NO-BREAK SPACE */
1798 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1799 case 0x3000: /* IDEOGRAPHIC SPACE */
1800 #endif /* Not EBCDIC */
1801 OK = TRUE;
1802 break;
1803
1804 default:
1805 OK = FALSE;
1806 break;
1807 }
1808
1809 if (OK == (d == OP_HSPACE))
1810 {
1811 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1812 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1813 {
1814 active_count--; /* Remove non-match possibility */
1815 next_active_state--;
1816 }
1817 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1818 }
1819 }
1820 break;
1821
1822 /*-----------------------------------------------------------------*/
1823 #ifdef SUPPORT_UCP
1824 case OP_PROP_EXTRA + OP_TYPEEXACT:
1825 case OP_PROP_EXTRA + OP_TYPEUPTO:
1826 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1827 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1828 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1829 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1830 count = current_state->count; /* Number already matched */
1831 if (clen > 0)
1832 {
1833 BOOL OK;
1834 const ucd_record * prop = GET_UCD(c);
1835 switch(code[1 + IMM2_SIZE + 1])
1836 {
1837 case PT_ANY:
1838 OK = TRUE;
1839 break;
1840
1841 case PT_LAMP:
1842 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1843 prop->chartype == ucp_Lt;
1844 break;
1845
1846 case PT_GC:
1847 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1848 break;
1849
1850 case PT_PC:
1851 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1852 break;
1853
1854 case PT_SC:
1855 OK = prop->script == code[1 + IMM2_SIZE + 2];
1856 break;
1857
1858 /* These are specials for combination cases. */
1859
1860 case PT_ALNUM:
1861 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1862 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1863 break;
1864
1865 case PT_SPACE: /* Perl space */
1866 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1867 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1868 break;
1869
1870 case PT_PXSPACE: /* POSIX space */
1871 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1872 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1873 c == CHAR_FF || c == CHAR_CR;
1874 break;
1875
1876 case PT_WORD:
1877 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1878 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1879 c == CHAR_UNDERSCORE;
1880 break;
1881
1882 /* Should never occur, but keep compilers from grumbling. */
1883
1884 default:
1885 OK = codevalue != OP_PROP;
1886 break;
1887 }
1888
1889 if (OK == (d == OP_PROP))
1890 {
1891 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1892 {
1893 active_count--; /* Remove non-match possibility */
1894 next_active_state--;
1895 }
1896 if (++count >= GET2(code, 1))
1897 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1898 else
1899 { ADD_NEW(state_offset, count); }
1900 }
1901 }
1902 break;
1903
1904 /*-----------------------------------------------------------------*/
1905 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1906 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1907 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1908 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1909 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1910 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1911 count = current_state->count; /* Number already matched */
1912 if (clen > 0)
1913 {
1914 int lgb, rgb;
1915 const pcre_uchar *nptr = ptr + clen;
1916 int ncount = 0;
1917 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1918 {
1919 active_count--; /* Remove non-match possibility */
1920 next_active_state--;
1921 }
1922 lgb = UCD_GRAPHBREAK(c);
1923 while (nptr < end_subject)
1924 {
1925 dlen = 1;
1926 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1927 rgb = UCD_GRAPHBREAK(d);
1928 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1929 ncount++;
1930 lgb = rgb;
1931 nptr += dlen;
1932 }
1933 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
1934 reset_could_continue = TRUE;
1935 if (++count >= GET2(code, 1))
1936 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1937 else
1938 { ADD_NEW_DATA(-state_offset, count, ncount); }
1939 }
1940 break;
1941 #endif
1942
1943 /*-----------------------------------------------------------------*/
1944 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1945 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1946 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1947 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1948 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1949 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1950 count = current_state->count; /* Number already matched */
1951 if (clen > 0)
1952 {
1953 int ncount = 0;
1954 switch (c)
1955 {
1956 case CHAR_VT:
1957 case CHAR_FF:
1958 case CHAR_NEL:
1959 #ifndef EBCDIC
1960 case 0x2028:
1961 case 0x2029:
1962 #endif /* Not EBCDIC */
1963 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1964 goto ANYNL03;
1965
1966 case CHAR_CR:
1967 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1968 /* Fall through */
1969
1970 ANYNL03:
1971 case CHAR_LF:
1972 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1973 {
1974 active_count--; /* Remove non-match possibility */
1975 next_active_state--;
1976 }
1977 if (++count >= GET2(code, 1))
1978 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1979 else
1980 { ADD_NEW_DATA(-state_offset, count, ncount); }
1981 break;
1982
1983 default:
1984 break;
1985 }
1986 }
1987 break;
1988
1989 /*-----------------------------------------------------------------*/
1990 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1991 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1992 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1993 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1994 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1995 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1996 count = current_state->count; /* Number already matched */
1997 if (clen > 0)
1998 {
1999 BOOL OK;
2000 switch (c)
2001 {
2002 case CHAR_LF:
2003 case CHAR_VT:
2004 case CHAR_FF:
2005 case CHAR_CR:
2006 case CHAR_NEL:
2007 #ifndef EBCDIC
2008 case 0x2028:
2009 case 0x2029:
2010 #endif /* Not EBCDIC */
2011 OK = TRUE;
2012 break;
2013
2014 default:
2015 OK = FALSE;
2016 }
2017
2018 if (OK == (d == OP_VSPACE))
2019 {
2020 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
2021 {
2022 active_count--; /* Remove non-match possibility */
2023 next_active_state--;
2024 }
2025 if (++count >= GET2(code, 1))
2026 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2027 else
2028 { ADD_NEW_DATA(-state_offset, count, 0); }
2029 }
2030 }
2031 break;
2032
2033 /*-----------------------------------------------------------------*/
2034 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
2035 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
2036 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
2037 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
2038 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
2039 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2040 count = current_state->count; /* Number already matched */
2041 if (clen > 0)
2042 {
2043 BOOL OK;
2044 switch (c)
2045 {
2046 case CHAR_HT:
2047 case CHAR_SPACE:
2048 #ifndef EBCDIC
2049 case 0xa0: /* NBSP */
2050 case 0x1680: /* OGHAM SPACE MARK */
2051 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2052 case 0x2000: /* EN QUAD */
2053 case 0x2001: /* EM QUAD */
2054 case 0x2002: /* EN SPACE */
2055 case 0x2003: /* EM SPACE */
2056 case 0x2004: /* THREE-PER-EM SPACE */
2057 case 0x2005: /* FOUR-PER-EM SPACE */
2058 case 0x2006: /* SIX-PER-EM SPACE */
2059 case 0x2007: /* FIGURE SPACE */
2060 case 0x2008: /* PUNCTUATION SPACE */
2061 case 0x2009: /* THIN SPACE */
2062 case 0x200A: /* HAIR SPACE */
2063 case 0x202f: /* NARROW NO-BREAK SPACE */
2064 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2065 case 0x3000: /* IDEOGRAPHIC SPACE */
2066 #endif /* Not EBCDIC */
2067 OK = TRUE;
2068 break;
2069
2070 default:
2071 OK = FALSE;
2072 break;
2073 }
2074
2075 if (OK == (d == OP_HSPACE))
2076 {
2077 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
2078 {
2079 active_count--; /* Remove non-match possibility */
2080 next_active_state--;
2081 }
2082 if (++count >= GET2(code, 1))
2083 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2084 else
2085 { ADD_NEW_DATA(-state_offset, count, 0); }
2086 }
2087 }
2088 break;
2089
2090 /* ========================================================================== */
2091 /* These opcodes are followed by a character that is usually compared
2092 to the current subject character; it is loaded into d. We still get
2093 here even if there is no subject character, because in some cases zero
2094 repetitions are permitted. */
2095
2096 /*-----------------------------------------------------------------*/
2097 case OP_CHAR:
2098 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2099 break;
2100
2101 /*-----------------------------------------------------------------*/
2102 case OP_CHARI:
2103 if (clen == 0) break;
2104
2105 #ifdef SUPPORT_UTF
2106 if (utf)
2107 {
2108 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2109 {
2110 unsigned int othercase;
2111 if (c < 128)
2112 othercase = fcc[c];
2113 else
2114 /* If we have Unicode property support, we can use it to test the
2115 other case of the character. */
2116 #ifdef SUPPORT_UCP
2117 othercase = UCD_OTHERCASE(c);
2118 #else
2119 othercase = NOTACHAR;
2120 #endif
2121
2122 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2123 }
2124 }
2125 else
2126 #endif /* SUPPORT_UTF */
2127 /* Not UTF mode */
2128 {
2129 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2130 { ADD_NEW(state_offset + 2, 0); }
2131 }
2132 break;
2133
2134
2135 #ifdef SUPPORT_UCP
2136 /*-----------------------------------------------------------------*/
2137 /* This is a tricky one because it can match more than one character.
2138 Find out how many characters to skip, and then set up a negative state
2139 to wait for them to pass before continuing. */
2140
2141 case OP_EXTUNI:
2142 if (clen > 0)
2143 {
2144 int lgb, rgb;
2145 const pcre_uchar *nptr = ptr + clen;
2146 int ncount = 0;
2147 lgb = UCD_GRAPHBREAK(c);
2148 while (nptr < end_subject)
2149 {
2150 dlen = 1;
2151 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
2152 rgb = UCD_GRAPHBREAK(d);
2153 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
2154 ncount++;
2155 lgb = rgb;
2156 nptr += dlen;
2157 }
2158 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
2159 reset_could_continue = TRUE;
2160 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2161 }
2162 break;
2163 #endif
2164
2165 /*-----------------------------------------------------------------*/
2166 /* This is a tricky like EXTUNI because it too can match more than one
2167 character (when CR is followed by LF). In this case, set up a negative
2168 state to wait for one character to pass before continuing. */
2169
2170 case OP_ANYNL:
2171 if (clen > 0) switch(c)
2172 {
2173 case CHAR_VT:
2174 case CHAR_FF:
2175 case CHAR_NEL:
2176 #ifndef EBCDIC
2177 case 0x2028:
2178 case 0x2029:
2179 #endif /* Not EBCDIC */
2180 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2181
2182 case CHAR_LF:
2183 ADD_NEW(state_offset + 1, 0);
2184 break;
2185
2186 case CHAR_CR:
2187 if (ptr + 1 >= end_subject)
2188 {
2189 ADD_NEW(state_offset + 1, 0);
2190 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
2191 reset_could_continue = TRUE;
2192 }
2193 else if (ptr[1] == CHAR_LF)
2194 {
2195 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2196 }
2197 else
2198 {
2199 ADD_NEW(state_offset + 1, 0);
2200 }
2201 break;
2202 }
2203 break;
2204
2205 /*-----------------------------------------------------------------*/
2206 case OP_NOT_VSPACE:
2207 if (clen > 0) switch(c)
2208 {
2209 case CHAR_LF:
2210 case CHAR_VT:
2211 case CHAR_FF:
2212 case CHAR_CR:
2213 case CHAR_NEL:
2214 #ifndef EBCDIC
2215 case 0x2028:
2216 case 0x2029:
2217 #endif /* Not EBCDIC */
2218 break;
2219
2220 default:
2221 ADD_NEW(state_offset + 1, 0);
2222 break;
2223 }
2224 break;
2225
2226 /*-----------------------------------------------------------------*/
2227 case OP_VSPACE:
2228 if (clen > 0) switch(c)
2229 {
2230 case CHAR_LF:
2231 case CHAR_VT:
2232 case CHAR_FF:
2233 case CHAR_CR:
2234 case CHAR_NEL:
2235 #ifndef EBCDIC
2236 case 0x2028:
2237 case 0x2029:
2238 #endif /* Not EBCDIC */
2239 ADD_NEW(state_offset + 1, 0);
2240 break;
2241
2242 default: break;
2243 }
2244 break;
2245
2246 /*-----------------------------------------------------------------*/
2247 case OP_NOT_HSPACE:
2248 if (clen > 0) switch(c)
2249 {
2250 case CHAR_HT:
2251 case CHAR_SPACE:
2252 #ifndef EBCDIC
2253 case 0xa0: /* NBSP */
2254 case 0x1680: /* OGHAM SPACE MARK */
2255 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2256 case 0x2000: /* EN QUAD */
2257 case 0x2001: /* EM QUAD */
2258 case 0x2002: /* EN SPACE */
2259 case 0x2003: /* EM SPACE */
2260 case 0x2004: /* THREE-PER-EM SPACE */
2261 case 0x2005: /* FOUR-PER-EM SPACE */
2262 case 0x2006: /* SIX-PER-EM SPACE */
2263 case 0x2007: /* FIGURE SPACE */
2264 case 0x2008: /* PUNCTUATION SPACE */
2265 case 0x2009: /* THIN SPACE */
2266 case 0x200A: /* HAIR SPACE */
2267 case 0x202f: /* NARROW NO-BREAK SPACE */
2268 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2269 case 0x3000: /* IDEOGRAPHIC SPACE */
2270 #endif /* Not EBCDIC */
2271 break;
2272
2273 default:
2274 ADD_NEW(state_offset + 1, 0);
2275 break;
2276 }
2277 break;
2278
2279 /*-----------------------------------------------------------------*/
2280 case OP_HSPACE:
2281 if (clen > 0) switch(c)
2282 {
2283 case CHAR_HT:
2284 case CHAR_SPACE:
2285 #ifndef EBCDIC
2286 case 0xa0: /* NBSP */
2287 case 0x1680: /* OGHAM SPACE MARK */
2288 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2289 case 0x2000: /* EN QUAD */
2290 case 0x2001: /* EM QUAD */
2291 case 0x2002: /* EN SPACE */
2292 case 0x2003: /* EM SPACE */
2293 case 0x2004: /* THREE-PER-EM SPACE */
2294 case 0x2005: /* FOUR-PER-EM SPACE */
2295 case 0x2006: /* SIX-PER-EM SPACE */
2296 case 0x2007: /* FIGURE SPACE */
2297 case 0x2008: /* PUNCTUATION SPACE */
2298 case 0x2009: /* THIN SPACE */
2299 case 0x200A: /* HAIR SPACE */
2300 case 0x202f: /* NARROW NO-BREAK SPACE */
2301 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2302 case 0x3000: /* IDEOGRAPHIC SPACE */
2303 #endif /* Not EBCDIC */
2304 ADD_NEW(state_offset + 1, 0);
2305 break;
2306 }
2307 break;
2308
2309 /*-----------------------------------------------------------------*/
2310 /* Match a negated single character casefully. */
2311
2312 case OP_NOT:
2313 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2314 break;
2315
2316 /*-----------------------------------------------------------------*/
2317 /* Match a negated single character caselessly. */
2318
2319 case OP_NOTI:
2320 if (clen > 0)
2321 {
2322 unsigned int otherd;
2323 #ifdef SUPPORT_UTF
2324 if (utf && d >= 128)
2325 {
2326 #ifdef SUPPORT_UCP
2327 otherd = UCD_OTHERCASE(d);
2328 #endif /* SUPPORT_UCP */
2329 }
2330 else
2331 #endif /* SUPPORT_UTF */
2332 otherd = TABLE_GET(d, fcc, d);
2333 if (c != d && c != otherd)
2334 { ADD_NEW(state_offset + dlen + 1, 0); }
2335 }
2336 break;
2337
2338 /*-----------------------------------------------------------------*/
2339 case OP_PLUSI:
2340 case OP_MINPLUSI:
2341 case OP_POSPLUSI:
2342 case OP_NOTPLUSI:
2343 case OP_NOTMINPLUSI:
2344 case OP_NOTPOSPLUSI:
2345 caseless = TRUE;
2346 codevalue -= OP_STARI - OP_STAR;
2347
2348 /* Fall through */
2349 case OP_PLUS:
2350 case OP_MINPLUS:
2351 case OP_POSPLUS:
2352 case OP_NOTPLUS:
2353 case OP_NOTMINPLUS:
2354 case OP_NOTPOSPLUS:
2355 count = current_state->count; /* Already matched */
2356 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2357 if (clen > 0)
2358 {
2359 unsigned int otherd = NOTACHAR;
2360 if (caseless)
2361 {
2362 #ifdef SUPPORT_UTF
2363 if (utf && d >= 128)
2364 {
2365 #ifdef SUPPORT_UCP
2366 otherd = UCD_OTHERCASE(d);
2367 #endif /* SUPPORT_UCP */
2368 }
2369 else
2370 #endif /* SUPPORT_UTF */
2371 otherd = TABLE_GET(d, fcc, d);
2372 }
2373 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2374 {
2375 if (count > 0 &&
2376 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2377 {
2378 active_count--; /* Remove non-match possibility */
2379 next_active_state--;
2380 }
2381 count++;
2382 ADD_NEW(state_offset, count);
2383 }
2384 }
2385 break;
2386
2387 /*-----------------------------------------------------------------*/
2388 case OP_QUERYI:
2389 case OP_MINQUERYI:
2390 case OP_POSQUERYI:
2391 case OP_NOTQUERYI:
2392 case OP_NOTMINQUERYI:
2393 case OP_NOTPOSQUERYI:
2394 caseless = TRUE;
2395 codevalue -= OP_STARI - OP_STAR;
2396 /* Fall through */
2397 case OP_QUERY:
2398 case OP_MINQUERY:
2399 case OP_POSQUERY:
2400 case OP_NOTQUERY:
2401 case OP_NOTMINQUERY:
2402 case OP_NOTPOSQUERY:
2403 ADD_ACTIVE(state_offset + dlen + 1, 0);
2404 if (clen > 0)
2405 {
2406 unsigned int otherd = NOTACHAR;
2407 if (caseless)
2408 {
2409 #ifdef SUPPORT_UTF
2410 if (utf && d >= 128)
2411 {
2412 #ifdef SUPPORT_UCP
2413 otherd = UCD_OTHERCASE(d);
2414 #endif /* SUPPORT_UCP */
2415 }
2416 else
2417 #endif /* SUPPORT_UTF */
2418 otherd = TABLE_GET(d, fcc, d);
2419 }
2420 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2421 {
2422 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2423 {
2424 active_count--; /* Remove non-match possibility */
2425 next_active_state--;
2426 }
2427 ADD_NEW(state_offset + dlen + 1, 0);
2428 }
2429 }
2430 break;
2431
2432 /*-----------------------------------------------------------------*/
2433 case OP_STARI:
2434 case OP_MINSTARI:
2435 case OP_POSSTARI:
2436 case OP_NOTSTARI:
2437 case OP_NOTMINSTARI:
2438 case OP_NOTPOSSTARI:
2439 caseless = TRUE;
2440 codevalue -= OP_STARI - OP_STAR;
2441 /* Fall through */
2442 case OP_STAR:
2443 case OP_MINSTAR:
2444 case OP_POSSTAR:
2445 case OP_NOTSTAR:
2446 case OP_NOTMINSTAR:
2447 case OP_NOTPOSSTAR:
2448 ADD_ACTIVE(state_offset + dlen + 1, 0);
2449 if (clen > 0)
2450 {
2451 unsigned int otherd = NOTACHAR;
2452 if (caseless)
2453 {
2454 #ifdef SUPPORT_UTF
2455 if (utf && d >= 128)
2456 {
2457 #ifdef SUPPORT_UCP
2458 otherd = UCD_OTHERCASE(d);
2459 #endif /* SUPPORT_UCP */
2460 }
2461 else
2462 #endif /* SUPPORT_UTF */
2463 otherd = TABLE_GET(d, fcc, d);
2464 }
2465 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2466 {
2467 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2468 {
2469 active_count--; /* Remove non-match possibility */
2470 next_active_state--;
2471 }
2472 ADD_NEW(state_offset, 0);
2473 }
2474 }
2475 break;
2476
2477 /*-----------------------------------------------------------------*/
2478 case OP_EXACTI:
2479 case OP_NOTEXACTI:
2480 caseless = TRUE;
2481 codevalue -= OP_STARI - OP_STAR;
2482 /* Fall through */
2483 case OP_EXACT:
2484 case OP_NOTEXACT:
2485 count = current_state->count; /* Number already matched */
2486 if (clen > 0)
2487 {
2488 unsigned int otherd = NOTACHAR;
2489 if (caseless)
2490 {
2491 #ifdef SUPPORT_UTF
2492 if (utf && d >= 128)
2493 {
2494 #ifdef SUPPORT_UCP
2495 otherd = UCD_OTHERCASE(d);
2496 #endif /* SUPPORT_UCP */
2497 }
2498 else
2499 #endif /* SUPPORT_UTF */
2500 otherd = TABLE_GET(d, fcc, d);
2501 }
2502 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2503 {
2504 if (++count >= GET2(code, 1))
2505 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2506 else
2507 { ADD_NEW(state_offset, count); }
2508 }
2509 }
2510 break;
2511
2512 /*-----------------------------------------------------------------*/
2513 case OP_UPTOI:
2514 case OP_MINUPTOI:
2515 case OP_POSUPTOI:
2516 case OP_NOTUPTOI:
2517 case OP_NOTMINUPTOI:
2518 case OP_NOTPOSUPTOI:
2519 caseless = TRUE;
2520 codevalue -= OP_STARI - OP_STAR;
2521 /* Fall through */
2522 case OP_UPTO:
2523 case OP_MINUPTO:
2524 case OP_POSUPTO:
2525 case OP_NOTUPTO:
2526 case OP_NOTMINUPTO:
2527 case OP_NOTPOSUPTO:
2528 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2529 count = current_state->count; /* Number already matched */
2530 if (clen > 0)
2531 {
2532 unsigned int otherd = NOTACHAR;
2533 if (caseless)
2534 {
2535 #ifdef SUPPORT_UTF
2536 if (utf && d >= 128)
2537 {
2538 #ifdef SUPPORT_UCP
2539 otherd = UCD_OTHERCASE(d);
2540 #endif /* SUPPORT_UCP */
2541 }
2542 else
2543 #endif /* SUPPORT_UTF */
2544 otherd = TABLE_GET(d, fcc, d);
2545 }
2546 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2547 {
2548 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2549 {
2550 active_count--; /* Remove non-match possibility */
2551 next_active_state--;
2552 }
2553 if (++count >= GET2(code, 1))
2554 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2555 else
2556 { ADD_NEW(state_offset, count); }
2557 }
2558 }
2559 break;
2560
2561
2562 /* ========================================================================== */
2563 /* These are the class-handling opcodes */
2564
2565 case OP_CLASS:
2566 case OP_NCLASS:
2567 case OP_XCLASS:
2568 {
2569 BOOL isinclass = FALSE;
2570 int next_state_offset;
2571 const pcre_uchar *ecode;
2572
2573 /* For a simple class, there is always just a 32-byte table, and we
2574 can set isinclass from it. */
2575
2576 if (codevalue != OP_XCLASS)
2577 {
2578 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2579 if (clen > 0)
2580 {
2581 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2582 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2583 }
2584 }
2585
2586 /* An extended class may have a table or a list of single characters,
2587 ranges, or both, and it may be positive or negative. There's a
2588 function that sorts all this out. */
2589
2590 else
2591 {
2592 ecode = code + GET(code, 1);
2593 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2594 }
2595
2596 /* At this point, isinclass is set for all kinds of class, and ecode
2597 points to the byte after the end of the class. If there is a
2598 quantifier, this is where it will be. */
2599
2600 next_state_offset = (int)(ecode - start_code);
2601
2602 switch (*ecode)
2603 {
2604 case OP_CRSTAR:
2605 case OP_CRMINSTAR:
2606 ADD_ACTIVE(next_state_offset + 1, 0);
2607 if (isinclass) { ADD_NEW(state_offset, 0); }
2608 break;
2609
2610 case OP_CRPLUS:
2611 case OP_CRMINPLUS:
2612 count = current_state->count; /* Already matched */
2613 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2614 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2615 break;
2616
2617 case OP_CRQUERY:
2618 case OP_CRMINQUERY:
2619 ADD_ACTIVE(next_state_offset + 1, 0);
2620 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2621 break;
2622
2623 case OP_CRRANGE:
2624 case OP_CRMINRANGE:
2625 count = current_state->count; /* Already matched */
2626 if (count >= GET2(ecode, 1))
2627 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2628 if (isinclass)
2629 {
2630 int max = GET2(ecode, 1 + IMM2_SIZE);
2631 if (++count >= max && max != 0) /* Max 0 => no limit */
2632 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2633 else
2634 { ADD_NEW(state_offset, count); }
2635 }
2636 break;
2637
2638 default:
2639 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2640 break;
2641 }
2642 }
2643 break;
2644
2645 /* ========================================================================== */
2646 /* These are the opcodes for fancy brackets of various kinds. We have
2647 to use recursion in order to handle them. The "always failing" assertion
2648 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2649 though the other "backtracking verbs" are not supported. */
2650
2651 case OP_FAIL:
2652 forced_fail++; /* Count FAILs for multiple states */
2653 break;
2654
2655 case OP_ASSERT:
2656 case OP_ASSERT_NOT:
2657 case OP_ASSERTBACK:
2658 case OP_ASSERTBACK_NOT:
2659 {
2660 int rc;
2661 int local_offsets[2];
2662 int local_workspace[1000];
2663 const pcre_uchar *endasscode = code + GET(code, 1);
2664
2665 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2666
2667 rc = internal_dfa_exec(
2668 md, /* static match data */
2669 code, /* this subexpression's code */
2670 ptr, /* where we currently are */
2671 (int)(ptr - start_subject), /* start offset */
2672 local_offsets, /* offset vector */
2673 sizeof(local_offsets)/sizeof(int), /* size of same */
2674 local_workspace, /* workspace vector */
2675 sizeof(local_workspace)/sizeof(int), /* size of same */
2676 rlevel); /* function recursion level */
2677
2678 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2679 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2680 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2681 }
2682 break;
2683
2684 /*-----------------------------------------------------------------*/
2685 case OP_COND:
2686 case OP_SCOND:
2687 {
2688 int local_offsets[1000];
2689 int local_workspace[1000];
2690 int codelink = GET(code, 1);
2691 int condcode;
2692
2693 /* Because of the way auto-callout works during compile, a callout item
2694 is inserted between OP_COND and an assertion condition. This does not
2695 happen for the other conditions. */
2696
2697 if (code[LINK_SIZE+1] == OP_CALLOUT)
2698 {
2699 rrc = 0;
2700 if (PUBL(callout) != NULL)
2701 {
2702 PUBL(callout_block) cb;
2703 cb.version = 1; /* Version 1 of the callout block */
2704 cb.callout_number = code[LINK_SIZE+2];
2705 cb.offset_vector = offsets;
2706 #ifdef COMPILE_PCRE8
2707 cb.subject = (PCRE_SPTR)start_subject;
2708 #else
2709 cb.subject = (PCRE_SPTR16)start_subject;
2710 #endif
2711 cb.subject_length = (int)(end_subject - start_subject);
2712 cb.start_match = (int)(current_subject - start_subject);
2713 cb.current_position = (int)(ptr - start_subject);
2714 cb.pattern_position = GET(code, LINK_SIZE + 3);
2715 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2716 cb.capture_top = 1;
2717 cb.capture_last = -1;
2718 cb.callout_data = md->callout_data;
2719 cb.mark = NULL; /* No (*MARK) support */
2720 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2721 }
2722 if (rrc > 0) break; /* Fail this thread */
2723 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2724 }
2725
2726 condcode = code[LINK_SIZE+1];
2727
2728 /* Back reference conditions are not supported */
2729
2730 if (condcode == OP_CREF || condcode == OP_NCREF)
2731 return PCRE_ERROR_DFA_UCOND;
2732
2733 /* The DEFINE condition is always false */
2734
2735 if (condcode == OP_DEF)
2736 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2737
2738 /* The only supported version of OP_RREF is for the value RREF_ANY,
2739 which means "test if in any recursion". We can't test for specifically
2740 recursed groups. */
2741
2742 else if (condcode == OP_RREF || condcode == OP_NRREF)
2743 {
2744 int value = GET2(code, LINK_SIZE + 2);
2745 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2746 if (md->recursive != NULL)
2747 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2748 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2749 }
2750
2751 /* Otherwise, the condition is an assertion */
2752
2753 else
2754 {
2755 int rc;
2756 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2757 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2758
2759 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2760
2761 rc = internal_dfa_exec(
2762 md, /* fixed match data */
2763 asscode, /* this subexpression's code */
2764 ptr, /* where we currently are */
2765 (int)(ptr - start_subject), /* start offset */
2766 local_offsets, /* offset vector */
2767 sizeof(local_offsets)/sizeof(int), /* size of same */
2768 local_workspace, /* workspace vector */
2769 sizeof(local_workspace)/sizeof(int), /* size of same */
2770 rlevel); /* function recursion level */
2771
2772 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2773 if ((rc >= 0) ==
2774 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2775 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2776 else
2777 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2778 }
2779 }
2780 break;
2781
2782 /*-----------------------------------------------------------------*/
2783 case OP_RECURSE:
2784 {
2785 dfa_recursion_info *ri;
2786 int local_offsets[1000];
2787 int local_workspace[1000];
2788 const pcre_uchar *callpat = start_code + GET(code, 1);
2789 int recno = (callpat == md->start_code)? 0 :
2790 GET2(callpat, 1 + LINK_SIZE);
2791 int rc;
2792
2793 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2794
2795 /* Check for repeating a recursion without advancing the subject
2796 pointer. This should catch convoluted mutual recursions. (Some simple
2797 cases are caught at compile time.) */
2798
2799 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2800 if (recno == ri->group_num && ptr == ri->subject_position)
2801 return PCRE_ERROR_RECURSELOOP;
2802
2803 /* Remember this recursion and where we started it so as to
2804 catch infinite loops. */
2805
2806 new_recursive.group_num = recno;
2807 new_recursive.subject_position = ptr;
2808 new_recursive.prevrec = md->recursive;
2809 md->recursive = &new_recursive;
2810
2811 rc = internal_dfa_exec(
2812 md, /* fixed match data */
2813 callpat, /* this subexpression's code */
2814 ptr, /* where we currently are */
2815 (int)(ptr - start_subject), /* start offset */
2816 local_offsets, /* offset vector */
2817 sizeof(local_offsets)/sizeof(int), /* size of same */
2818 local_workspace, /* workspace vector */
2819 sizeof(local_workspace)/sizeof(int), /* size of same */
2820 rlevel); /* function recursion level */
2821
2822 md->recursive = new_recursive.prevrec; /* Done this recursion */
2823
2824 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2825 rc));
2826
2827 /* Ran out of internal offsets */
2828
2829 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2830
2831 /* For each successful matched substring, set up the next state with a
2832 count of characters to skip before trying it. Note that the count is in
2833 characters, not bytes. */
2834
2835 if (rc > 0)
2836 {
2837 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2838 {
2839 int charcount = local_offsets[rc+1] - local_offsets[rc];
2840 #ifdef SUPPORT_UTF
2841 if (utf)
2842 {
2843 const pcre_uchar *p = start_subject + local_offsets[rc];
2844 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2845 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2846 }
2847 #endif
2848 if (charcount > 0)
2849 {
2850 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2851 }
2852 else
2853 {
2854 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2855 }
2856 }
2857 }
2858 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2859 }
2860 break;
2861
2862 /*-----------------------------------------------------------------*/
2863 case OP_BRAPOS:
2864 case OP_SBRAPOS:
2865 case OP_CBRAPOS:
2866 case OP_SCBRAPOS:
2867 case OP_BRAPOSZERO:
2868 {
2869 int charcount, matched_count;
2870 const pcre_uchar *local_ptr = ptr;
2871 BOOL allow_zero;
2872
2873 if (codevalue == OP_BRAPOSZERO)
2874 {
2875 allow_zero = TRUE;
2876 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2877 }
2878 else allow_zero = FALSE;
2879
2880 /* Loop to match the subpattern as many times as possible as if it were
2881 a complete pattern. */
2882
2883 for (matched_count = 0;; matched_count++)
2884 {
2885 int local_offsets[2];
2886 int local_workspace[1000];
2887
2888 int rc = internal_dfa_exec(
2889 md, /* fixed match data */
2890 code, /* this subexpression's code */
2891 local_ptr, /* where we currently are */
2892 (int)(ptr - start_subject), /* start offset */
2893 local_offsets, /* offset vector */
2894 sizeof(local_offsets)/sizeof(int), /* size of same */
2895 local_workspace, /* workspace vector */
2896 sizeof(local_workspace)/sizeof(int), /* size of same */
2897 rlevel); /* function recursion level */
2898
2899 /* Failed to match */
2900
2901 if (rc < 0)
2902 {
2903 if (rc != PCRE_ERROR_NOMATCH) return rc;
2904 break;
2905 }
2906
2907 /* Matched: break the loop if zero characters matched. */
2908
2909 charcount = local_offsets[1] - local_offsets[0];
2910 if (charcount == 0) break;
2911 local_ptr += charcount; /* Advance temporary position ptr */
2912 }
2913
2914 /* At this point we have matched the subpattern matched_count
2915 times, and local_ptr is pointing to the character after the end of the
2916 last match. */
2917
2918 if (matched_count > 0 || allow_zero)
2919 {
2920 const pcre_uchar *end_subpattern = code;
2921 int next_state_offset;
2922
2923 do { end_subpattern += GET(end_subpattern, 1); }
2924 while (*end_subpattern == OP_ALT);
2925 next_state_offset =
2926 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2927
2928 /* Optimization: if there are no more active states, and there
2929 are no new states yet set up, then skip over the subject string
2930 right here, to save looping. Otherwise, set up the new state to swing
2931 into action when the end of the matched substring is reached. */
2932
2933 if (i + 1 >= active_count && new_count == 0)
2934 {
2935 ptr = local_ptr;
2936 clen = 0;
2937 ADD_NEW(next_state_offset, 0);
2938 }
2939 else
2940 {
2941 const pcre_uchar *p = ptr;
2942 const pcre_uchar *pp = local_ptr;
2943 charcount = (int)(pp - p);
2944 #ifdef SUPPORT_UTF
2945 if (utf) while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2946 #endif
2947 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2948 }
2949 }
2950 }
2951 break;
2952
2953 /*-----------------------------------------------------------------*/
2954 case OP_ONCE:
2955 case OP_ONCE_NC:
2956 {
2957 int local_offsets[2];
2958 int local_workspace[1000];
2959
2960 int rc = internal_dfa_exec(
2961 md, /* fixed match data */
2962 code, /* this subexpression's code */
2963 ptr, /* where we currently are */
2964 (int)(ptr - start_subject), /* start offset */
2965 local_offsets, /* offset vector */
2966 sizeof(local_offsets)/sizeof(int), /* size of same */
2967 local_workspace, /* workspace vector */
2968 sizeof(local_workspace)/sizeof(int), /* size of same */
2969 rlevel); /* function recursion level */
2970
2971 if (rc >= 0)
2972 {
2973 const pcre_uchar *end_subpattern = code;
2974 int charcount = local_offsets[1] - local_offsets[0];
2975 int next_state_offset, repeat_state_offset;
2976
2977 do { end_subpattern += GET(end_subpattern, 1); }
2978 while (*end_subpattern == OP_ALT);
2979 next_state_offset =
2980 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2981
2982 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2983 arrange for the repeat state also to be added to the relevant list.
2984 Calculate the offset, or set -1 for no repeat. */
2985
2986 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2987 *end_subpattern == OP_KETRMIN)?
2988 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2989
2990 /* If we have matched an empty string, add the next state at the
2991 current character pointer. This is important so that the duplicate
2992 checking kicks in, which is what breaks infinite loops that match an
2993 empty string. */
2994
2995 if (charcount == 0)
2996 {
2997 ADD_ACTIVE(next_state_offset, 0);
2998 }
2999
3000 /* Optimization: if there are no more active states, and there
3001 are no new states yet set up, then skip over the subject string
3002 right here, to save looping. Otherwise, set up the new state to swing
3003 into action when the end of the matched substring is reached. */
3004
3005 else if (i + 1 >= active_count && new_count == 0)
3006 {
3007 ptr += charcount;
3008 clen = 0;
3009 ADD_NEW(next_state_offset, 0);
3010
3011 /* If we are adding a repeat state at the new character position,
3012 we must fudge things so that it is the only current state.
3013 Otherwise, it might be a duplicate of one we processed before, and
3014 that would cause it to be skipped. */
3015
3016 if (repeat_state_offset >= 0)
3017 {
3018 next_active_state = active_states;
3019 active_count = 0;
3020 i = -1;
3021 ADD_ACTIVE(repeat_state_offset, 0);
3022 }
3023 }
3024 else
3025 {
3026 #ifdef SUPPORT_UTF
3027 if (utf)
3028 {
3029 const pcre_uchar *p = start_subject + local_offsets[0];
3030 const pcre_uchar *pp = start_subject + local_offsets[1];
3031 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
3032 }
3033 #endif
3034 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
3035 if (repeat_state_offset >= 0)
3036 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
3037 }
3038 }
3039 else if (rc != PCRE_ERROR_NOMATCH) return rc;
3040 }
3041 break;
3042
3043
3044 /* ========================================================================== */
3045 /* Handle callouts */
3046
3047 case OP_CALLOUT:
3048 rrc = 0;
3049 if (PUBL(callout) != NULL)
3050 {
3051 PUBL(callout_block) cb;
3052 cb.version = 1; /* Version 1 of the callout block */
3053 cb.callout_number = code[1];
3054 cb.offset_vector = offsets;
3055 #ifdef COMPILE_PCRE8
3056 cb.subject = (PCRE_SPTR)start_subject;
3057 #else
3058 cb.subject = (PCRE_SPTR16)start_subject;
3059 #endif
3060 cb.subject_length = (int)(end_subject - start_subject);
3061 cb.start_match = (int)(current_subject - start_subject);
3062 cb.current_position = (int)(ptr - start_subject);
3063 cb.pattern_position = GET(code, 2);
3064 cb.next_item_length = GET(code, 2 + LINK_SIZE);
3065 cb.capture_top = 1;
3066 cb.capture_last = -1;
3067 cb.callout_data = md->callout_data;
3068 cb.mark = NULL; /* No (*MARK) support */
3069 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
3070 }
3071 if (rrc == 0)
3072 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
3073 break;
3074
3075
3076 /* ========================================================================== */
3077 default: /* Unsupported opcode */
3078 return PCRE_ERROR_DFA_UITEM;
3079 }
3080
3081 NEXT_ACTIVE_STATE: continue;
3082
3083 } /* End of loop scanning active states */
3084
3085 /* We have finished the processing at the current subject character. If no
3086 new states have been set for the next character, we have found all the
3087 matches that we are going to find. If we are at the top level and partial
3088 matching has been requested, check for appropriate conditions.
3089
3090 The "forced_ fail" variable counts the number of (*F) encountered for the
3091 character. If it is equal to the original active_count (saved in
3092 workspace[1]) it means that (*F) was found on every active state. In this
3093 case we don't want to give a partial match.
3094
3095 The "could_continue" variable is true if a state could have continued but
3096 for the fact that the end of the subject was reached. */
3097
3098 if (new_count <= 0)
3099 {
3100 if (rlevel == 1 && /* Top level, and */
3101 could_continue && /* Some could go on, and */
3102 forced_fail != workspace[1] && /* Not all forced fail & */
3103 ( /* either... */
3104 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
3105 || /* or... */
3106 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
3107 match_count < 0) /* no matches */
3108 ) && /* And... */
3109 (
3110 partial_newline || /* Either partial NL */
3111 ( /* or ... */
3112 ptr >= end_subject && /* End of subject and */
3113 ptr > md->start_used_ptr) /* Inspected non-empty string */
3114 )
3115 )
3116 {
3117 if (offsetcount >= 2)
3118 {
3119 offsets[0] = (int)(md->start_used_ptr - start_subject);
3120 offsets[1] = (int)(end_subject - start_subject);
3121 }
3122 match_count = PCRE_ERROR_PARTIAL;
3123 }
3124
3125 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
3126 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
3127 rlevel*2-2, SP));
3128 break; /* In effect, "return", but see the comment below */
3129 }
3130
3131 /* One or more states are active for the next character. */
3132
3133 ptr += clen; /* Advance to next subject character */
3134 } /* Loop to move along the subject string */
3135
3136 /* Control gets here from "break" a few lines above. We do it this way because
3137 if we use "return" above, we have compiler trouble. Some compilers warn if
3138 there's nothing here because they think the function doesn't return a value. On
3139 the other hand, if we put a dummy statement here, some more clever compilers
3140 complain that it can't be reached. Sigh. */
3141
3142 return match_count;
3143 }
3144
3145
3146
3147
3148 /*************************************************
3149 * Execute a Regular Expression - DFA engine *
3150 *************************************************/
3151
3152 /* This external function applies a compiled re to a subject string using a DFA
3153 engine. This function calls the internal function multiple times if the pattern
3154 is not anchored.
3155
3156 Arguments:
3157 argument_re points to the compiled expression
3158 extra_data points to extra data or is NULL
3159 subject points to the subject string
3160 length length of subject string (may contain binary zeros)
3161 start_offset where to start in the subject string
3162 options option bits
3163 offsets vector of match offsets
3164 offsetcount size of same
3165 workspace workspace vector
3166 wscount size of same
3167
3168 Returns: > 0 => number of match offset pairs placed in offsets
3169 = 0 => offsets overflowed; longest matches are present
3170 -1 => failed to match
3171 < -1 => some kind of unexpected problem
3172 */
3173
3174 #ifdef COMPILE_PCRE8
3175 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3176 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3177 const char *subject, int length, int start_offset, int options, int *offsets,
3178 int offsetcount, int *workspace, int wscount)
3179 #else
3180 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3181 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3182 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3183 int offsetcount, int *workspace, int wscount)
3184 #endif
3185 {
3186 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3187 dfa_match_data match_block;
3188 dfa_match_data *md = &match_block;
3189 BOOL utf, anchored, startline, firstline;
3190 const pcre_uchar *current_subject, *end_subject;
3191 const pcre_study_data *study = NULL;
3192
3193 const pcre_uchar *req_char_ptr;
3194 const pcre_uint8 *start_bits = NULL;
3195 BOOL has_first_char = FALSE;
3196 BOOL has_req_char = FALSE;
3197 pcre_uchar first_char = 0;
3198 pcre_uchar first_char2 = 0;
3199 pcre_uchar req_char = 0;
3200 pcre_uchar req_char2 = 0;
3201 int newline;
3202
3203 /* Plausibility checks */
3204
3205 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3206 if (re == NULL || subject == NULL || workspace == NULL ||
3207 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3208 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3209 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3210 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3211
3212 /* Check that the first field in the block is the magic number. If it is not,
3213 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3214 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3215 means that the pattern is likely compiled with different endianness. */
3216
3217 if (re->magic_number != MAGIC_NUMBER)
3218 return re->magic_number == REVERSED_MAGIC_NUMBER?
3219 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3220 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3221
3222 /* If restarting after a partial match, do some sanity checks on the contents
3223 of the workspace. */
3224
3225 if ((options & PCRE_DFA_RESTART) != 0)
3226 {
3227 if ((workspace[0] & (-2)) != 0 || workspace[1] < 1 ||
3228 workspace[1] > (wscount - 2)/INTS_PER_STATEBLOCK)
3229 return PCRE_ERROR_DFA_BADRESTART;
3230 }
3231
3232 /* Set up study, callout, and table data */
3233
3234 md->tables = re->tables;
3235 md->callout_data = NULL;
3236
3237 if (extra_data != NULL)
3238 {
3239 unsigned int flags = extra_data->flags;
3240 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3241 study = (const pcre_study_data *)extra_data->study_data;
3242 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3243 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3244 return PCRE_ERROR_DFA_UMLIMIT;
3245 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3246 md->callout_data = extra_data->callout_data;
3247 if ((flags & PCRE_EXTRA_TABLES) != 0)
3248 md->tables = extra_data->tables;
3249 }
3250
3251 /* Set some local values */
3252
3253 current_subject = (const pcre_uchar *)subject + start_offset;
3254 end_subject = (const pcre_uchar *)subject + length;
3255 req_char_ptr = current_subject - 1;
3256
3257 #ifdef SUPPORT_UTF
3258 /* PCRE_UTF16 has the same value as PCRE_UTF8. */
3259 utf = (re->options & PCRE_UTF8) != 0;
3260 #else
3261 utf = FALSE;
3262 #endif
3263
3264 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3265 (re->options & PCRE_ANCHORED) != 0;
3266
3267 /* The remaining fixed data for passing around. */
3268
3269 md->start_code = (const pcre_uchar *)argument_re +
3270 re->name_table_offset + re->name_count * re->name_entry_size;
3271 md->start_subject = (const pcre_uchar *)subject;
3272 md->end_subject = end_subject;
3273 md->start_offset = start_offset;
3274 md->moptions = options;
3275 md->poptions = re->options;
3276
3277 /* If the BSR option is not set at match time, copy what was set
3278 at compile time. */
3279
3280 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3281 {
3282 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3283 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3284 #ifdef BSR_ANYCRLF
3285 else md->moptions |= PCRE_BSR_ANYCRLF;
3286 #endif
3287 }
3288
3289 /* Handle different types of newline. The three bits give eight cases. If
3290 nothing is set at run time, whatever was used at compile time applies. */
3291
3292 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3293 PCRE_NEWLINE_BITS)
3294 {
3295 case 0: newline = NEWLINE; break; /* Compile-time default */
3296 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3297 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3298 case PCRE_NEWLINE_CR+
3299 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3300 case PCRE_NEWLINE_ANY: newline = -1; break;
3301 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3302 default: return PCRE_ERROR_BADNEWLINE;
3303 }
3304
3305 if (newline == -2)
3306 {
3307 md->nltype = NLTYPE_ANYCRLF;
3308 }
3309 else if (newline < 0)
3310 {
3311 md->nltype = NLTYPE_ANY;
3312 }
3313 else
3314 {
3315 md->nltype = NLTYPE_FIXED;
3316 if (newline > 255)
3317 {
3318 md->nllen = 2;
3319 md->nl[0] = (newline >> 8) & 255;
3320 md->nl[1] = newline & 255;
3321 }
3322 else
3323 {
3324 md->nllen = 1;
3325 md->nl[0] = newline;
3326 }
3327 }
3328
3329 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3330 back the character offset. */
3331
3332 #ifdef SUPPORT_UTF
3333 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3334 {
3335 int erroroffset;
3336 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3337 if (errorcode != 0)
3338 {
3339 if (offsetcount >= 2)
3340 {
3341 offsets[0] = erroroffset;
3342 offsets[1] = errorcode;
3343 }
3344 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3345 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3346 }
3347 if (start_offset > 0 && start_offset < length &&
3348 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3349 return PCRE_ERROR_BADUTF8_OFFSET;
3350 }
3351 #endif
3352
3353 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3354 is a feature that makes it possible to save compiled regex and re-use them
3355 in other programs later. */
3356
3357 if (md->tables == NULL) md->tables = PRIV(default_tables);
3358
3359 /* The "must be at the start of a line" flags are used in a loop when finding
3360 where to start. */
3361
3362 startline = (re->flags & PCRE_STARTLINE) != 0;
3363 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3364
3365 /* Set up the first character to match, if available. The first_byte value is
3366 never set for an anchored regular expression, but the anchoring may be forced
3367 at run time, so we have to test for anchoring. The first char may be unset for
3368 an unanchored pattern, of course. If there's no first char and the pattern was
3369 studied, there may be a bitmap of possible first characters. */
3370
3371 if (!anchored)
3372 {
3373 if ((re->flags & PCRE_FIRSTSET) != 0)
3374 {
3375 has_first_char = TRUE;
3376 first_char = first_char2 = (pcre_uchar)(re->first_char);
3377 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3378 {
3379 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3380 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3381 if (utf && first_char > 127)
3382 first_char2 = UCD_OTHERCASE(first_char);
3383 #endif
3384 }
3385 }
3386 else
3387 {
3388 if (!startline && study != NULL &&
3389 (study->flags & PCRE_STUDY_MAPPED) != 0)
3390 start_bits = study->start_bits;
3391 }
3392 }
3393
3394 /* For anchored or unanchored matches, there may be a "last known required
3395 character" set. */
3396
3397 if ((re->flags & PCRE_REQCHSET) != 0)
3398 {
3399 has_req_char = TRUE;
3400 req_char = req_char2 = (pcre_uchar)(re->req_char);
3401 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3402 {
3403 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3404 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3405 if (utf && req_char > 127)
3406 req_char2 = UCD_OTHERCASE(req_char);
3407 #endif
3408 }
3409 }
3410
3411 /* Call the main matching function, looping for a non-anchored regex after a
3412 failed match. If not restarting, perform certain optimizations at the start of
3413 a match. */
3414
3415 for (;;)
3416 {
3417 int rc;
3418
3419 if ((options & PCRE_DFA_RESTART) == 0)
3420 {
3421 const pcre_uchar *save_end_subject = end_subject;
3422
3423 /* If firstline is TRUE, the start of the match is constrained to the first
3424 line of a multiline string. Implement this by temporarily adjusting
3425 end_subject so that we stop scanning at a newline. If the match fails at
3426 the newline, later code breaks this loop. */
3427
3428 if (firstline)
3429 {
3430 PCRE_PUCHAR t = current_subject;
3431 #ifdef SUPPORT_UTF
3432 if (utf)
3433 {
3434 while (t < md->end_subject && !IS_NEWLINE(t))
3435 {
3436 t++;
3437 ACROSSCHAR(t < end_subject, *t, t++);
3438 }
3439 }
3440 else
3441 #endif
3442 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3443 end_subject = t;
3444 }
3445
3446 /* There are some optimizations that avoid running the match if a known
3447 starting point is not found. However, there is an option that disables
3448 these, for testing and for ensuring that all callouts do actually occur.
3449 The option can be set in the regex by (*NO_START_OPT) or passed in
3450 match-time options. */
3451
3452 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3453 {
3454 /* Advance to a known first char. */
3455
3456 if (has_first_char)
3457 {
3458 if (first_char != first_char2)
3459 while (current_subject < end_subject &&
3460 *current_subject != first_char && *current_subject != first_char2)
3461 current_subject++;
3462 else
3463 while (current_subject < end_subject &&
3464 *current_subject != first_char)
3465 current_subject++;
3466 }
3467
3468 /* Or to just after a linebreak for a multiline match if possible */
3469
3470 else if (startline)
3471 {
3472 if (current_subject > md->start_subject + start_offset)
3473 {
3474 #ifdef SUPPORT_UTF
3475 if (utf)
3476 {
3477 while (current_subject < end_subject &&
3478 !WAS_NEWLINE(current_subject))
3479 {
3480 current_subject++;
3481 ACROSSCHAR(current_subject < end_subject, *current_subject,
3482 current_subject++);
3483 }
3484 }
3485 else
3486 #endif
3487 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3488 current_subject++;
3489
3490 /* If we have just passed a CR and the newline option is ANY or
3491 ANYCRLF, and we are now at a LF, advance the match position by one
3492 more character. */
3493
3494 if (current_subject[-1] == CHAR_CR &&
3495 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3496 current_subject < end_subject &&
3497 *current_subject == CHAR_NL)
3498 current_subject++;
3499 }
3500 }
3501
3502 /* Or to a non-unique first char after study */
3503
3504 else if (start_bits != NULL)
3505 {
3506 while (current_subject < end_subject)
3507 {
3508 register unsigned int c = *current_subject;
3509 #ifndef COMPILE_PCRE8
3510 if (c > 255) c = 255;
3511 #endif
3512 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3513 {
3514 current_subject++;
3515 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3516 /* In non 8-bit mode, the iteration will stop for
3517 characters > 255 at the beginning or not stop at all. */
3518 if (utf)
3519 ACROSSCHAR(current_subject < end_subject, *current_subject,
3520 current_subject++);
3521 #endif
3522 }
3523 else break;
3524 }
3525 }
3526 }
3527
3528 /* Restore fudged end_subject */
3529
3530 end_subject = save_end_subject;
3531
3532 /* The following two optimizations are disabled for partial matching or if
3533 disabling is explicitly requested (and of course, by the test above, this
3534 code is not obeyed when restarting after a partial match). */
3535
3536 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3537 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3538 {
3539 /* If the pattern was studied, a minimum subject length may be set. This
3540 is a lower bound; no actual string of that length may actually match the
3541 pattern. Although the value is, strictly, in characters, we treat it as
3542 bytes to avoid spending too much time in this optimization. */
3543
3544 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3545 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3546 return PCRE_ERROR_NOMATCH;
3547
3548 /* If req_char is set, we know that that character must appear in the
3549 subject for the match to succeed. If the first character is set, req_char
3550 must be later in the subject; otherwise the test starts at the match
3551 point. This optimization can save a huge amount of work in patterns with
3552 nested unlimited repeats that aren't going to match. Writing separate
3553 code for cased/caseless versions makes it go faster, as does using an
3554 autoincrement and backing off on a match.
3555
3556 HOWEVER: when the subject string is very, very long, searching to its end
3557 can take a long time, and give bad performance on quite ordinary
3558 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3559 string... so we don't do this when the string is sufficiently long. */
3560
3561 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3562 {
3563 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3564
3565 /* We don't need to repeat the search if we haven't yet reached the
3566 place we found it at last time. */
3567
3568 if (p > req_char_ptr)
3569 {
3570 if (req_char != req_char2)
3571 {
3572 while (p < end_subject)
3573 {
3574 register int pp = *p++;
3575 if (pp == req_char || pp == req_char2) { p--; break; }
3576 }
3577 }
3578 else
3579 {
3580 while (p < end_subject)
3581 {
3582 if (*p++ == req_char) { p--; break; }
3583 }
3584 }
3585
3586 /* If we can't find the required character, break the matching loop,
3587 which will cause a return or PCRE_ERROR_NOMATCH. */
3588
3589 if (p >= end_subject) break;
3590
3591 /* If we have found the required character, save the point where we
3592 found it, so that we don't search again next time round the loop if
3593 the start hasn't passed this character yet. */
3594
3595 req_char_ptr = p;
3596 }
3597 }
3598 }
3599 } /* End of optimizations that are done when not restarting */
3600
3601 /* OK, now we can do the business */
3602
3603 md->start_used_ptr = current_subject;
3604 md->recursive = NULL;
3605
3606 rc = internal_dfa_exec(
3607 md, /* fixed match data */
3608 md->start_code, /* this subexpression's code */
3609 current_subject, /* where we currently are */
3610 start_offset, /* start offset in subject */
3611 offsets, /* offset vector */
3612 offsetcount, /* size of same */
3613 workspace, /* workspace vector */
3614 wscount, /* size of same */
3615 0); /* function recurse level */
3616
3617 /* Anything other than "no match" means we are done, always; otherwise, carry
3618 on only if not anchored. */
3619
3620 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3621
3622 /* Advance to the next subject character unless we are at the end of a line
3623 and firstline is set. */
3624
3625 if (firstline && IS_NEWLINE(current_subject)) break;
3626 current_subject++;
3627 #ifdef SUPPORT_UTF
3628 if (utf)
3629 {
3630 ACROSSCHAR(current_subject < end_subject, *current_subject,
3631 current_subject++);
3632 }
3633 #endif
3634 if (current_subject > end_subject) break;
3635
3636 /* If we have just passed a CR and we are now at a LF, and the pattern does
3637 not contain any explicit matches for \r or \n, and the newline option is CRLF
3638 or ANY or ANYCRLF, advance the match position by one more character. */
3639
3640 if (current_subject[-1] == CHAR_CR &&
3641 current_subject < end_subject &&
3642 *current_subject == CHAR_NL &&
3643 (re->flags & PCRE_HASCRORLF) == 0 &&
3644 (md->nltype == NLTYPE_ANY ||
3645 md->nltype == NLTYPE_ANYCRLF ||
3646 md->nllen == 2))
3647 current_subject++;
3648
3649 } /* "Bumpalong" loop */
3650
3651 return PCRE_ERROR_NOMATCH;
3652 }
3653
3654 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5