/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1100 - (show annotations)
Tue Oct 16 15:56:26 2012 UTC (7 years ago) by chpe
File MIME type: text/plain
File size: 123945 byte(s)
Error occurred while calculating annotation data.
pcre32: exec: Mask bits > 21 in 32-bit UTF mode

Allow passing characters with high bits set in UTF-32 mode.
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41 /* This module contains the external function pcre_dfa_exec(), which is an
42 alternative matching function that uses a sort of DFA algorithm (not a true
43 FSM). This is NOT Perl-compatible, but it has advantages in certain
44 applications. */
45
46
47 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
48 the performance of his patterns greatly. I could not use it as it stood, as it
49 was not thread safe, and made assumptions about pattern sizes. Also, it caused
50 test 7 to loop, and test 9 to crash with a segfault.
51
52 The issue is the check for duplicate states, which is done by a simple linear
53 search up the state list. (Grep for "duplicate" below to find the code.) For
54 many patterns, there will never be many states active at one time, so a simple
55 linear search is fine. In patterns that have many active states, it might be a
56 bottleneck. The suggested code used an indexing scheme to remember which states
57 had previously been used for each character, and avoided the linear search when
58 it knew there was no chance of a duplicate. This was implemented when adding
59 states to the state lists.
60
61 I wrote some thread-safe, not-limited code to try something similar at the time
62 of checking for duplicates (instead of when adding states), using index vectors
63 on the stack. It did give a 13% improvement with one specially constructed
64 pattern for certain subject strings, but on other strings and on many of the
65 simpler patterns in the test suite it did worse. The major problem, I think,
66 was the extra time to initialize the index. This had to be done for each call
67 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
68 only once - I suspect this was the cause of the problems with the tests.)
69
70 Overall, I concluded that the gains in some cases did not outweigh the losses
71 in others, so I abandoned this code. */
72
73
74
75 #ifdef HAVE_CONFIG_H
76 #include "config.h"
77 #endif
78
79 #define NLBLOCK md /* Block containing newline information */
80 #define PSSTART start_subject /* Field containing processed string start */
81 #define PSEND end_subject /* Field containing processed string end */
82
83 #include "pcre_internal.h"
84
85
86 /* For use to indent debugging output */
87
88 #define SP " "
89
90
91 /*************************************************
92 * Code parameters and static tables *
93 *************************************************/
94
95 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
96 into others, under special conditions. A gap of 20 between the blocks should be
97 enough. The resulting opcodes don't have to be less than 256 because they are
98 never stored, so we push them well clear of the normal opcodes. */
99
100 #define OP_PROP_EXTRA 300
101 #define OP_EXTUNI_EXTRA 320
102 #define OP_ANYNL_EXTRA 340
103 #define OP_HSPACE_EXTRA 360
104 #define OP_VSPACE_EXTRA 380
105
106
107 /* This table identifies those opcodes that are followed immediately by a
108 character that is to be tested in some way. This makes it possible to
109 centralize the loading of these characters. In the case of Type * etc, the
110 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
111 small value. Non-zero values in the table are the offsets from the opcode where
112 the character is to be found. ***NOTE*** If the start of this table is
113 modified, the three tables that follow must also be modified. */
114
115 static const pcre_uint8 coptable[] = {
116 0, /* End */
117 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
118 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
119 0, 0, 0, /* Any, AllAny, Anybyte */
120 0, 0, /* \P, \p */
121 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
122 0, /* \X */
123 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
124 1, /* Char */
125 1, /* Chari */
126 1, /* not */
127 1, /* noti */
128 /* Positive single-char repeats */
129 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
130 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
131 1+IMM2_SIZE, /* exact */
132 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
135 1+IMM2_SIZE, /* exact I */
136 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
137 /* Negative single-char repeats - only for chars < 256 */
138 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
139 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
140 1+IMM2_SIZE, /* NOT exact */
141 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
142 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
143 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
144 1+IMM2_SIZE, /* NOT exact I */
145 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
146 /* Positive type repeats */
147 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
148 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
149 1+IMM2_SIZE, /* Type exact */
150 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
151 /* Character class & ref repeats */
152 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
153 0, 0, /* CRRANGE, CRMINRANGE */
154 0, /* CLASS */
155 0, /* NCLASS */
156 0, /* XCLASS - variable length */
157 0, /* REF */
158 0, /* REFI */
159 0, /* RECURSE */
160 0, /* CALLOUT */
161 0, /* Alt */
162 0, /* Ket */
163 0, /* KetRmax */
164 0, /* KetRmin */
165 0, /* KetRpos */
166 0, /* Reverse */
167 0, /* Assert */
168 0, /* Assert not */
169 0, /* Assert behind */
170 0, /* Assert behind not */
171 0, 0, /* ONCE, ONCE_NC */
172 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
173 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
174 0, 0, /* CREF, NCREF */
175 0, 0, /* RREF, NRREF */
176 0, /* DEF */
177 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
178 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
179 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
180 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
181 0, 0 /* CLOSE, SKIPZERO */
182 };
183
184 /* This table identifies those opcodes that inspect a character. It is used to
185 remember the fact that a character could have been inspected when the end of
186 the subject is reached. ***NOTE*** If the start of this table is modified, the
187 two tables that follow must also be modified. */
188
189 static const pcre_uint8 poptable[] = {
190 0, /* End */
191 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
192 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
193 1, 1, 1, /* Any, AllAny, Anybyte */
194 1, 1, /* \P, \p */
195 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
196 1, /* \X */
197 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
198 1, /* Char */
199 1, /* Chari */
200 1, /* not */
201 1, /* noti */
202 /* Positive single-char repeats */
203 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
204 1, 1, 1, /* upto, minupto, exact */
205 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
206 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
207 1, 1, 1, /* upto I, minupto I, exact I */
208 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
209 /* Negative single-char repeats - only for chars < 256 */
210 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
211 1, 1, 1, /* NOT upto, minupto, exact */
212 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
213 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
214 1, 1, 1, /* NOT upto I, minupto I, exact I */
215 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
216 /* Positive type repeats */
217 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
218 1, 1, 1, /* Type upto, minupto, exact */
219 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
220 /* Character class & ref repeats */
221 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
222 1, 1, /* CRRANGE, CRMINRANGE */
223 1, /* CLASS */
224 1, /* NCLASS */
225 1, /* XCLASS - variable length */
226 0, /* REF */
227 0, /* REFI */
228 0, /* RECURSE */
229 0, /* CALLOUT */
230 0, /* Alt */
231 0, /* Ket */
232 0, /* KetRmax */
233 0, /* KetRmin */
234 0, /* KetRpos */
235 0, /* Reverse */
236 0, /* Assert */
237 0, /* Assert not */
238 0, /* Assert behind */
239 0, /* Assert behind not */
240 0, 0, /* ONCE, ONCE_NC */
241 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
242 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
243 0, 0, /* CREF, NCREF */
244 0, 0, /* RREF, NRREF */
245 0, /* DEF */
246 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
247 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
248 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
249 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
250 0, 0 /* CLOSE, SKIPZERO */
251 };
252
253 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
254 and \w */
255
256 static const pcre_uint8 toptable1[] = {
257 0, 0, 0, 0, 0, 0,
258 ctype_digit, ctype_digit,
259 ctype_space, ctype_space,
260 ctype_word, ctype_word,
261 0, 0 /* OP_ANY, OP_ALLANY */
262 };
263
264 static const pcre_uint8 toptable2[] = {
265 0, 0, 0, 0, 0, 0,
266 ctype_digit, 0,
267 ctype_space, 0,
268 ctype_word, 0,
269 1, 1 /* OP_ANY, OP_ALLANY */
270 };
271
272
273 /* Structure for holding data about a particular state, which is in effect the
274 current data for an active path through the match tree. It must consist
275 entirely of ints because the working vector we are passed, and which we put
276 these structures in, is a vector of ints. */
277
278 typedef struct stateblock {
279 int offset; /* Offset to opcode */
280 int count; /* Count for repeats */
281 int data; /* Some use extra data */
282 } stateblock;
283
284 #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
285
286
287 #ifdef PCRE_DEBUG
288 /*************************************************
289 * Print character string *
290 *************************************************/
291
292 /* Character string printing function for debugging.
293
294 Arguments:
295 p points to string
296 length number of bytes
297 f where to print
298
299 Returns: nothing
300 */
301
302 static void
303 pchars(const pcre_uchar *p, int length, FILE *f)
304 {
305 pcre_uint32 c;
306 while (length-- > 0)
307 {
308 if (isprint(c = *(p++)))
309 fprintf(f, "%c", c);
310 else
311 fprintf(f, "\\x{%02x}", c);
312 }
313 }
314 #endif
315
316
317
318 /*************************************************
319 * Execute a Regular Expression - DFA engine *
320 *************************************************/
321
322 /* This internal function applies a compiled pattern to a subject string,
323 starting at a given point, using a DFA engine. This function is called from the
324 external one, possibly multiple times if the pattern is not anchored. The
325 function calls itself recursively for some kinds of subpattern.
326
327 Arguments:
328 md the match_data block with fixed information
329 this_start_code the opening bracket of this subexpression's code
330 current_subject where we currently are in the subject string
331 start_offset start offset in the subject string
332 offsets vector to contain the matching string offsets
333 offsetcount size of same
334 workspace vector of workspace
335 wscount size of same
336 rlevel function call recursion level
337
338 Returns: > 0 => number of match offset pairs placed in offsets
339 = 0 => offsets overflowed; longest matches are present
340 -1 => failed to match
341 < -1 => some kind of unexpected problem
342
343 The following macros are used for adding states to the two state vectors (one
344 for the current character, one for the following character). */
345
346 #define ADD_ACTIVE(x,y) \
347 if (active_count++ < wscount) \
348 { \
349 next_active_state->offset = (x); \
350 next_active_state->count = (y); \
351 next_active_state++; \
352 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
353 } \
354 else return PCRE_ERROR_DFA_WSSIZE
355
356 #define ADD_ACTIVE_DATA(x,y,z) \
357 if (active_count++ < wscount) \
358 { \
359 next_active_state->offset = (x); \
360 next_active_state->count = (y); \
361 next_active_state->data = (z); \
362 next_active_state++; \
363 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
364 } \
365 else return PCRE_ERROR_DFA_WSSIZE
366
367 #define ADD_NEW(x,y) \
368 if (new_count++ < wscount) \
369 { \
370 next_new_state->offset = (x); \
371 next_new_state->count = (y); \
372 next_new_state++; \
373 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
374 } \
375 else return PCRE_ERROR_DFA_WSSIZE
376
377 #define ADD_NEW_DATA(x,y,z) \
378 if (new_count++ < wscount) \
379 { \
380 next_new_state->offset = (x); \
381 next_new_state->count = (y); \
382 next_new_state->data = (z); \
383 next_new_state++; \
384 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d) line %d\n", rlevel*2-2, SP, \
385 (x), (y), (z), __LINE__)); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 BOOL reset_could_continue = FALSE;
428
429 rlevel++;
430 offsetcount &= (-2);
431
432 wscount -= 2;
433 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
434 (2 * INTS_PER_STATEBLOCK);
435
436 DPRINTF(("\n%.*s---------------------\n"
437 "%.*sCall to internal_dfa_exec f=%d\n",
438 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
439
440 ctypes = md->tables + ctypes_offset;
441 lcc = md->tables + lcc_offset;
442 fcc = md->tables + fcc_offset;
443
444 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
445
446 active_states = (stateblock *)(workspace + 2);
447 next_new_state = new_states = active_states + wscount;
448 new_count = 0;
449
450 first_op = this_start_code + 1 + LINK_SIZE +
451 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
452 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
453 ? IMM2_SIZE:0);
454
455 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
456 the alternative states onto the list, and find out where the end is. This
457 makes is possible to use this function recursively, when we want to stop at a
458 matching internal ket rather than at the end.
459
460 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
461 a backward assertion. In that case, we have to find out the maximum amount to
462 move back, and set up each alternative appropriately. */
463
464 if (*first_op == OP_REVERSE)
465 {
466 int max_back = 0;
467 int gone_back;
468
469 end_code = this_start_code;
470 do
471 {
472 int back = GET(end_code, 2+LINK_SIZE);
473 if (back > max_back) max_back = back;
474 end_code += GET(end_code, 1);
475 }
476 while (*end_code == OP_ALT);
477
478 /* If we can't go back the amount required for the longest lookbehind
479 pattern, go back as far as we can; some alternatives may still be viable. */
480
481 #ifdef SUPPORT_UTF
482 /* In character mode we have to step back character by character */
483
484 if (utf)
485 {
486 for (gone_back = 0; gone_back < max_back; gone_back++)
487 {
488 if (current_subject <= start_subject) break;
489 current_subject--;
490 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
491 }
492 }
493 else
494 #endif
495
496 /* In byte-mode we can do this quickly. */
497
498 {
499 gone_back = (current_subject - max_back < start_subject)?
500 (int)(current_subject - start_subject) : max_back;
501 current_subject -= gone_back;
502 }
503
504 /* Save the earliest consulted character */
505
506 if (current_subject < md->start_used_ptr)
507 md->start_used_ptr = current_subject;
508
509 /* Now we can process the individual branches. */
510
511 end_code = this_start_code;
512 do
513 {
514 int back = GET(end_code, 2+LINK_SIZE);
515 if (back <= gone_back)
516 {
517 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
518 ADD_NEW_DATA(-bstate, 0, gone_back - back);
519 }
520 end_code += GET(end_code, 1);
521 }
522 while (*end_code == OP_ALT);
523 }
524
525 /* This is the code for a "normal" subpattern (not a backward assertion). The
526 start of a whole pattern is always one of these. If we are at the top level,
527 we may be asked to restart matching from the same point that we reached for a
528 previous partial match. We still have to scan through the top-level branches to
529 find the end state. */
530
531 else
532 {
533 end_code = this_start_code;
534
535 /* Restarting */
536
537 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
538 {
539 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
540 new_count = workspace[1];
541 if (!workspace[0])
542 memcpy(new_states, active_states, new_count * sizeof(stateblock));
543 }
544
545 /* Not restarting */
546
547 else
548 {
549 int length = 1 + LINK_SIZE +
550 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
551 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
552 ? IMM2_SIZE:0);
553 do
554 {
555 ADD_NEW((int)(end_code - start_code + length), 0);
556 end_code += GET(end_code, 1);
557 length = 1 + LINK_SIZE;
558 }
559 while (*end_code == OP_ALT);
560 }
561 }
562
563 workspace[0] = 0; /* Bit indicating which vector is current */
564
565 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
566
567 /* Loop for scanning the subject */
568
569 ptr = current_subject;
570 for (;;)
571 {
572 int i, j;
573 int clen, dlen;
574 pcre_uint32 c, d;
575 int forced_fail = 0;
576 BOOL partial_newline = FALSE;
577 BOOL could_continue = reset_could_continue;
578 reset_could_continue = FALSE;
579
580 /* Make the new state list into the active state list and empty the
581 new state list. */
582
583 temp_states = active_states;
584 active_states = new_states;
585 new_states = temp_states;
586 active_count = new_count;
587 new_count = 0;
588
589 workspace[0] ^= 1; /* Remember for the restarting feature */
590 workspace[1] = active_count;
591
592 #ifdef PCRE_DEBUG
593 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
594 pchars(ptr, STRLEN_UC(ptr), stdout);
595 printf("\"\n");
596
597 printf("%.*sActive states: ", rlevel*2-2, SP);
598 for (i = 0; i < active_count; i++)
599 printf("%d/%d ", active_states[i].offset, active_states[i].count);
600 printf("\n");
601 #endif
602
603 /* Set the pointers for adding new states */
604
605 next_active_state = active_states + active_count;
606 next_new_state = new_states;
607
608 /* Load the current character from the subject outside the loop, as many
609 different states may want to look at it, and we assume that at least one
610 will. */
611
612 if (ptr < end_subject)
613 {
614 clen = 1; /* Number of data items in the character */
615 #ifdef SUPPORT_UTF
616 GETCHARLENTEST(c, ptr, clen);
617 #else
618 c = *ptr;
619 #endif /* SUPPORT_UTF */
620 }
621 else
622 {
623 clen = 0; /* This indicates the end of the subject */
624 c = NOTACHAR; /* This value should never actually be used */
625 }
626
627 /* Scan up the active states and act on each one. The result of an action
628 may be to add more states to the currently active list (e.g. on hitting a
629 parenthesis) or it may be to put states on the new list, for considering
630 when we move the character pointer on. */
631
632 for (i = 0; i < active_count; i++)
633 {
634 stateblock *current_state = active_states + i;
635 BOOL caseless = FALSE;
636 const pcre_uchar *code;
637 int state_offset = current_state->offset;
638 int count, codevalue, rrc;
639
640 #ifdef PCRE_DEBUG
641 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
642 if (clen == 0) printf("EOL\n");
643 else if (c > 32 && c < 127) printf("'%c'\n", c);
644 else printf("0x%02x\n", c);
645 #endif
646
647 /* A negative offset is a special case meaning "hold off going to this
648 (negated) state until the number of characters in the data field have
649 been skipped". If the could_continue flag was passed over from a previous
650 state, arrange for it to passed on. */
651
652 if (state_offset < 0)
653 {
654 if (current_state->data > 0)
655 {
656 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
657 ADD_NEW_DATA(state_offset, current_state->count,
658 current_state->data - 1);
659 if (could_continue) reset_could_continue = TRUE;
660 continue;
661 }
662 else
663 {
664 current_state->offset = state_offset = -state_offset;
665 }
666 }
667
668 /* Check for a duplicate state with the same count, and skip if found.
669 See the note at the head of this module about the possibility of improving
670 performance here. */
671
672 for (j = 0; j < i; j++)
673 {
674 if (active_states[j].offset == state_offset &&
675 active_states[j].count == current_state->count)
676 {
677 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
678 goto NEXT_ACTIVE_STATE;
679 }
680 }
681
682 /* The state offset is the offset to the opcode */
683
684 code = start_code + state_offset;
685 codevalue = *code;
686
687 /* If this opcode inspects a character, but we are at the end of the
688 subject, remember the fact for use when testing for a partial match. */
689
690 if (clen == 0 && poptable[codevalue] != 0)
691 could_continue = TRUE;
692
693 /* If this opcode is followed by an inline character, load it. It is
694 tempting to test for the presence of a subject character here, but that
695 is wrong, because sometimes zero repetitions of the subject are
696 permitted.
697
698 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
699 argument that is not a data character - but is always one byte long because
700 the values are small. We have to take special action to deal with \P, \p,
701 \H, \h, \V, \v and \X in this case. To keep the other cases fast, convert
702 these ones to new opcodes. */
703
704 if (coptable[codevalue] > 0)
705 {
706 dlen = 1;
707 #ifdef SUPPORT_UTF
708 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
709 #endif /* SUPPORT_UTF */
710 d = code[coptable[codevalue]];
711 if (codevalue >= OP_TYPESTAR)
712 {
713 switch(d)
714 {
715 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
716 case OP_NOTPROP:
717 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
718 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
719 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
720 case OP_NOT_HSPACE:
721 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
722 case OP_NOT_VSPACE:
723 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
724 default: break;
725 }
726 }
727 }
728 else
729 {
730 dlen = 0; /* Not strictly necessary, but compilers moan */
731 d = NOTACHAR; /* if these variables are not set. */
732 }
733
734
735 /* Now process the individual opcodes */
736
737 switch (codevalue)
738 {
739 /* ========================================================================== */
740 /* These cases are never obeyed. This is a fudge that causes a compile-
741 time error if the vectors coptable or poptable, which are indexed by
742 opcode, are not the correct length. It seems to be the only way to do
743 such a check at compile time, as the sizeof() operator does not work
744 in the C preprocessor. */
745
746 case OP_TABLE_LENGTH:
747 case OP_TABLE_LENGTH +
748 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
749 (sizeof(poptable) == OP_TABLE_LENGTH)):
750 break;
751
752 /* ========================================================================== */
753 /* Reached a closing bracket. If not at the end of the pattern, carry
754 on with the next opcode. For repeating opcodes, also add the repeat
755 state. Note that KETRPOS will always be encountered at the end of the
756 subpattern, because the possessive subpattern repeats are always handled
757 using recursive calls. Thus, it never adds any new states.
758
759 At the end of the (sub)pattern, unless we have an empty string and
760 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
761 start of the subject, save the match data, shifting up all previous
762 matches so we always have the longest first. */
763
764 case OP_KET:
765 case OP_KETRMIN:
766 case OP_KETRMAX:
767 case OP_KETRPOS:
768 if (code != end_code)
769 {
770 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
771 if (codevalue != OP_KET)
772 {
773 ADD_ACTIVE(state_offset - GET(code, 1), 0);
774 }
775 }
776 else
777 {
778 if (ptr > current_subject ||
779 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
780 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
781 current_subject > start_subject + md->start_offset)))
782 {
783 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
784 else if (match_count > 0 && ++match_count * 2 > offsetcount)
785 match_count = 0;
786 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
787 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
788 if (offsetcount >= 2)
789 {
790 offsets[0] = (int)(current_subject - start_subject);
791 offsets[1] = (int)(ptr - start_subject);
792 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
793 offsets[1] - offsets[0], (char *)current_subject));
794 }
795 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
796 {
797 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
798 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
799 match_count, rlevel*2-2, SP));
800 return match_count;
801 }
802 }
803 }
804 break;
805
806 /* ========================================================================== */
807 /* These opcodes add to the current list of states without looking
808 at the current character. */
809
810 /*-----------------------------------------------------------------*/
811 case OP_ALT:
812 do { code += GET(code, 1); } while (*code == OP_ALT);
813 ADD_ACTIVE((int)(code - start_code), 0);
814 break;
815
816 /*-----------------------------------------------------------------*/
817 case OP_BRA:
818 case OP_SBRA:
819 do
820 {
821 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
822 code += GET(code, 1);
823 }
824 while (*code == OP_ALT);
825 break;
826
827 /*-----------------------------------------------------------------*/
828 case OP_CBRA:
829 case OP_SCBRA:
830 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
831 code += GET(code, 1);
832 while (*code == OP_ALT)
833 {
834 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
835 code += GET(code, 1);
836 }
837 break;
838
839 /*-----------------------------------------------------------------*/
840 case OP_BRAZERO:
841 case OP_BRAMINZERO:
842 ADD_ACTIVE(state_offset + 1, 0);
843 code += 1 + GET(code, 2);
844 while (*code == OP_ALT) code += GET(code, 1);
845 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
846 break;
847
848 /*-----------------------------------------------------------------*/
849 case OP_SKIPZERO:
850 code += 1 + GET(code, 2);
851 while (*code == OP_ALT) code += GET(code, 1);
852 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
853 break;
854
855 /*-----------------------------------------------------------------*/
856 case OP_CIRC:
857 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
858 { ADD_ACTIVE(state_offset + 1, 0); }
859 break;
860
861 /*-----------------------------------------------------------------*/
862 case OP_CIRCM:
863 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
864 (ptr != end_subject && WAS_NEWLINE(ptr)))
865 { ADD_ACTIVE(state_offset + 1, 0); }
866 break;
867
868 /*-----------------------------------------------------------------*/
869 case OP_EOD:
870 if (ptr >= end_subject)
871 {
872 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
873 could_continue = TRUE;
874 else { ADD_ACTIVE(state_offset + 1, 0); }
875 }
876 break;
877
878 /*-----------------------------------------------------------------*/
879 case OP_SOD:
880 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
881 break;
882
883 /*-----------------------------------------------------------------*/
884 case OP_SOM:
885 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
886 break;
887
888
889 /* ========================================================================== */
890 /* These opcodes inspect the next subject character, and sometimes
891 the previous one as well, but do not have an argument. The variable
892 clen contains the length of the current character and is zero if we are
893 at the end of the subject. */
894
895 /*-----------------------------------------------------------------*/
896 case OP_ANY:
897 if (clen > 0 && !IS_NEWLINE(ptr))
898 {
899 if (ptr + 1 >= md->end_subject &&
900 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
901 NLBLOCK->nltype == NLTYPE_FIXED &&
902 NLBLOCK->nllen == 2 &&
903 c == NLBLOCK->nl[0])
904 {
905 could_continue = partial_newline = TRUE;
906 }
907 else
908 {
909 ADD_NEW(state_offset + 1, 0);
910 }
911 }
912 break;
913
914 /*-----------------------------------------------------------------*/
915 case OP_ALLANY:
916 if (clen > 0)
917 { ADD_NEW(state_offset + 1, 0); }
918 break;
919
920 /*-----------------------------------------------------------------*/
921 case OP_EODN:
922 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
923 could_continue = TRUE;
924 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
925 { ADD_ACTIVE(state_offset + 1, 0); }
926 break;
927
928 /*-----------------------------------------------------------------*/
929 case OP_DOLL:
930 if ((md->moptions & PCRE_NOTEOL) == 0)
931 {
932 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
933 could_continue = TRUE;
934 else if (clen == 0 ||
935 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
936 (ptr == end_subject - md->nllen)
937 ))
938 { ADD_ACTIVE(state_offset + 1, 0); }
939 else if (ptr + 1 >= md->end_subject &&
940 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
941 NLBLOCK->nltype == NLTYPE_FIXED &&
942 NLBLOCK->nllen == 2 &&
943 c == NLBLOCK->nl[0])
944 {
945 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
946 {
947 reset_could_continue = TRUE;
948 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
949 }
950 else could_continue = partial_newline = TRUE;
951 }
952 }
953 break;
954
955 /*-----------------------------------------------------------------*/
956 case OP_DOLLM:
957 if ((md->moptions & PCRE_NOTEOL) == 0)
958 {
959 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
960 could_continue = TRUE;
961 else if (clen == 0 ||
962 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
963 { ADD_ACTIVE(state_offset + 1, 0); }
964 else if (ptr + 1 >= md->end_subject &&
965 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
966 NLBLOCK->nltype == NLTYPE_FIXED &&
967 NLBLOCK->nllen == 2 &&
968 c == NLBLOCK->nl[0])
969 {
970 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
971 {
972 reset_could_continue = TRUE;
973 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
974 }
975 else could_continue = partial_newline = TRUE;
976 }
977 }
978 else if (IS_NEWLINE(ptr))
979 { ADD_ACTIVE(state_offset + 1, 0); }
980 break;
981
982 /*-----------------------------------------------------------------*/
983
984 case OP_DIGIT:
985 case OP_WHITESPACE:
986 case OP_WORDCHAR:
987 if (clen > 0 && c < 256 &&
988 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
989 { ADD_NEW(state_offset + 1, 0); }
990 break;
991
992 /*-----------------------------------------------------------------*/
993 case OP_NOT_DIGIT:
994 case OP_NOT_WHITESPACE:
995 case OP_NOT_WORDCHAR:
996 if (clen > 0 && (c >= 256 ||
997 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
998 { ADD_NEW(state_offset + 1, 0); }
999 break;
1000
1001 /*-----------------------------------------------------------------*/
1002 case OP_WORD_BOUNDARY:
1003 case OP_NOT_WORD_BOUNDARY:
1004 {
1005 int left_word, right_word;
1006
1007 if (ptr > start_subject)
1008 {
1009 const pcre_uchar *temp = ptr - 1;
1010 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
1011 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
1012 if (utf) { BACKCHAR(temp); }
1013 #endif
1014 GETCHARTEST(d, temp);
1015 #ifdef SUPPORT_UCP
1016 if ((md->poptions & PCRE_UCP) != 0)
1017 {
1018 if (d == '_') left_word = TRUE; else
1019 {
1020 int cat = UCD_CATEGORY(d);
1021 left_word = (cat == ucp_L || cat == ucp_N);
1022 }
1023 }
1024 else
1025 #endif
1026 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1027 }
1028 else left_word = FALSE;
1029
1030 if (clen > 0)
1031 {
1032 #ifdef SUPPORT_UCP
1033 if ((md->poptions & PCRE_UCP) != 0)
1034 {
1035 if (c == '_') right_word = TRUE; else
1036 {
1037 int cat = UCD_CATEGORY(c);
1038 right_word = (cat == ucp_L || cat == ucp_N);
1039 }
1040 }
1041 else
1042 #endif
1043 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1044 }
1045 else right_word = FALSE;
1046
1047 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1048 { ADD_ACTIVE(state_offset + 1, 0); }
1049 }
1050 break;
1051
1052
1053 /*-----------------------------------------------------------------*/
1054 /* Check the next character by Unicode property. We will get here only
1055 if the support is in the binary; otherwise a compile-time error occurs.
1056 */
1057
1058 #ifdef SUPPORT_UCP
1059 case OP_PROP:
1060 case OP_NOTPROP:
1061 if (clen > 0)
1062 {
1063 BOOL OK;
1064 const pcre_uint32 *cp;
1065 const ucd_record * prop = GET_UCD(c);
1066 switch(code[1])
1067 {
1068 case PT_ANY:
1069 OK = TRUE;
1070 break;
1071
1072 case PT_LAMP:
1073 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1074 prop->chartype == ucp_Lt;
1075 break;
1076
1077 case PT_GC:
1078 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1079 break;
1080
1081 case PT_PC:
1082 OK = prop->chartype == code[2];
1083 break;
1084
1085 case PT_SC:
1086 OK = prop->script == code[2];
1087 break;
1088
1089 /* These are specials for combination cases. */
1090
1091 case PT_ALNUM:
1092 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1093 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1094 break;
1095
1096 case PT_SPACE: /* Perl space */
1097 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1098 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1099 break;
1100
1101 case PT_PXSPACE: /* POSIX space */
1102 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1103 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1104 c == CHAR_FF || c == CHAR_CR;
1105 break;
1106
1107 case PT_WORD:
1108 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1109 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1110 c == CHAR_UNDERSCORE;
1111 break;
1112
1113 case PT_CLIST:
1114 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1115 for (;;)
1116 {
1117 if (c < *cp) { OK = FALSE; break; }
1118 if (c == *cp++) { OK = TRUE; break; }
1119 }
1120 break;
1121
1122 /* Should never occur, but keep compilers from grumbling. */
1123
1124 default:
1125 OK = codevalue != OP_PROP;
1126 break;
1127 }
1128
1129 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1130 }
1131 break;
1132 #endif
1133
1134
1135
1136 /* ========================================================================== */
1137 /* These opcodes likewise inspect the subject character, but have an
1138 argument that is not a data character. It is one of these opcodes:
1139 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1140 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1141
1142 case OP_TYPEPLUS:
1143 case OP_TYPEMINPLUS:
1144 case OP_TYPEPOSPLUS:
1145 count = current_state->count; /* Already matched */
1146 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1147 if (clen > 0)
1148 {
1149 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1150 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1151 NLBLOCK->nltype == NLTYPE_FIXED &&
1152 NLBLOCK->nllen == 2 &&
1153 c == NLBLOCK->nl[0])
1154 {
1155 could_continue = partial_newline = TRUE;
1156 }
1157 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1158 (c < 256 &&
1159 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1160 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1161 {
1162 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1163 {
1164 active_count--; /* Remove non-match possibility */
1165 next_active_state--;
1166 }
1167 count++;
1168 ADD_NEW(state_offset, count);
1169 }
1170 }
1171 break;
1172
1173 /*-----------------------------------------------------------------*/
1174 case OP_TYPEQUERY:
1175 case OP_TYPEMINQUERY:
1176 case OP_TYPEPOSQUERY:
1177 ADD_ACTIVE(state_offset + 2, 0);
1178 if (clen > 0)
1179 {
1180 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1181 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1182 NLBLOCK->nltype == NLTYPE_FIXED &&
1183 NLBLOCK->nllen == 2 &&
1184 c == NLBLOCK->nl[0])
1185 {
1186 could_continue = partial_newline = TRUE;
1187 }
1188 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1189 (c < 256 &&
1190 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1191 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1192 {
1193 if (codevalue == OP_TYPEPOSQUERY)
1194 {
1195 active_count--; /* Remove non-match possibility */
1196 next_active_state--;
1197 }
1198 ADD_NEW(state_offset + 2, 0);
1199 }
1200 }
1201 break;
1202
1203 /*-----------------------------------------------------------------*/
1204 case OP_TYPESTAR:
1205 case OP_TYPEMINSTAR:
1206 case OP_TYPEPOSSTAR:
1207 ADD_ACTIVE(state_offset + 2, 0);
1208 if (clen > 0)
1209 {
1210 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1211 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1212 NLBLOCK->nltype == NLTYPE_FIXED &&
1213 NLBLOCK->nllen == 2 &&
1214 c == NLBLOCK->nl[0])
1215 {
1216 could_continue = partial_newline = TRUE;
1217 }
1218 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1219 (c < 256 &&
1220 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1221 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1222 {
1223 if (codevalue == OP_TYPEPOSSTAR)
1224 {
1225 active_count--; /* Remove non-match possibility */
1226 next_active_state--;
1227 }
1228 ADD_NEW(state_offset, 0);
1229 }
1230 }
1231 break;
1232
1233 /*-----------------------------------------------------------------*/
1234 case OP_TYPEEXACT:
1235 count = current_state->count; /* Number already matched */
1236 if (clen > 0)
1237 {
1238 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1239 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1240 NLBLOCK->nltype == NLTYPE_FIXED &&
1241 NLBLOCK->nllen == 2 &&
1242 c == NLBLOCK->nl[0])
1243 {
1244 could_continue = partial_newline = TRUE;
1245 }
1246 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1247 (c < 256 &&
1248 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1249 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1250 {
1251 if (++count >= GET2(code, 1))
1252 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1253 else
1254 { ADD_NEW(state_offset, count); }
1255 }
1256 }
1257 break;
1258
1259 /*-----------------------------------------------------------------*/
1260 case OP_TYPEUPTO:
1261 case OP_TYPEMINUPTO:
1262 case OP_TYPEPOSUPTO:
1263 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1264 count = current_state->count; /* Number already matched */
1265 if (clen > 0)
1266 {
1267 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1268 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1269 NLBLOCK->nltype == NLTYPE_FIXED &&
1270 NLBLOCK->nllen == 2 &&
1271 c == NLBLOCK->nl[0])
1272 {
1273 could_continue = partial_newline = TRUE;
1274 }
1275 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1276 (c < 256 &&
1277 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1278 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1279 {
1280 if (codevalue == OP_TYPEPOSUPTO)
1281 {
1282 active_count--; /* Remove non-match possibility */
1283 next_active_state--;
1284 }
1285 if (++count >= GET2(code, 1))
1286 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1287 else
1288 { ADD_NEW(state_offset, count); }
1289 }
1290 }
1291 break;
1292
1293 /* ========================================================================== */
1294 /* These are virtual opcodes that are used when something like
1295 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1296 argument. It keeps the code above fast for the other cases. The argument
1297 is in the d variable. */
1298
1299 #ifdef SUPPORT_UCP
1300 case OP_PROP_EXTRA + OP_TYPEPLUS:
1301 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1302 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1303 count = current_state->count; /* Already matched */
1304 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1305 if (clen > 0)
1306 {
1307 BOOL OK;
1308 const pcre_uint32 *cp;
1309 const ucd_record * prop = GET_UCD(c);
1310 switch(code[2])
1311 {
1312 case PT_ANY:
1313 OK = TRUE;
1314 break;
1315
1316 case PT_LAMP:
1317 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1318 prop->chartype == ucp_Lt;
1319 break;
1320
1321 case PT_GC:
1322 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1323 break;
1324
1325 case PT_PC:
1326 OK = prop->chartype == code[3];
1327 break;
1328
1329 case PT_SC:
1330 OK = prop->script == code[3];
1331 break;
1332
1333 /* These are specials for combination cases. */
1334
1335 case PT_ALNUM:
1336 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1337 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1338 break;
1339
1340 case PT_SPACE: /* Perl space */
1341 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1342 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1343 break;
1344
1345 case PT_PXSPACE: /* POSIX space */
1346 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1347 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1348 c == CHAR_FF || c == CHAR_CR;
1349 break;
1350
1351 case PT_WORD:
1352 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1353 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1354 c == CHAR_UNDERSCORE;
1355 break;
1356
1357 case PT_CLIST:
1358 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1359 for (;;)
1360 {
1361 if (c < *cp) { OK = FALSE; break; }
1362 if (c == *cp++) { OK = TRUE; break; }
1363 }
1364 break;
1365
1366 /* Should never occur, but keep compilers from grumbling. */
1367
1368 default:
1369 OK = codevalue != OP_PROP;
1370 break;
1371 }
1372
1373 if (OK == (d == OP_PROP))
1374 {
1375 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1376 {
1377 active_count--; /* Remove non-match possibility */
1378 next_active_state--;
1379 }
1380 count++;
1381 ADD_NEW(state_offset, count);
1382 }
1383 }
1384 break;
1385
1386 /*-----------------------------------------------------------------*/
1387 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1388 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1389 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1390 count = current_state->count; /* Already matched */
1391 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1392 if (clen > 0)
1393 {
1394 int lgb, rgb;
1395 const pcre_uchar *nptr = ptr + clen;
1396 int ncount = 0;
1397 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1398 {
1399 active_count--; /* Remove non-match possibility */
1400 next_active_state--;
1401 }
1402 lgb = UCD_GRAPHBREAK(c);
1403 while (nptr < end_subject)
1404 {
1405 dlen = 1;
1406 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1407 rgb = UCD_GRAPHBREAK(d);
1408 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1409 ncount++;
1410 lgb = rgb;
1411 nptr += dlen;
1412 }
1413 count++;
1414 ADD_NEW_DATA(-state_offset, count, ncount);
1415 }
1416 break;
1417 #endif
1418
1419 /*-----------------------------------------------------------------*/
1420 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1421 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1422 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1423 count = current_state->count; /* Already matched */
1424 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1425 if (clen > 0)
1426 {
1427 int ncount = 0;
1428 switch (c)
1429 {
1430 case CHAR_VT:
1431 case CHAR_FF:
1432 case CHAR_NEL:
1433 #ifndef EBCDIC
1434 case 0x2028:
1435 case 0x2029:
1436 #endif /* Not EBCDIC */
1437 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1438 goto ANYNL01;
1439
1440 case CHAR_CR:
1441 if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
1442 /* Fall through */
1443
1444 ANYNL01:
1445 case CHAR_LF:
1446 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1447 {
1448 active_count--; /* Remove non-match possibility */
1449 next_active_state--;
1450 }
1451 count++;
1452 ADD_NEW_DATA(-state_offset, count, ncount);
1453 break;
1454
1455 default:
1456 break;
1457 }
1458 }
1459 break;
1460
1461 /*-----------------------------------------------------------------*/
1462 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1463 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1464 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1465 count = current_state->count; /* Already matched */
1466 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1467 if (clen > 0)
1468 {
1469 BOOL OK;
1470 switch (c)
1471 {
1472 VSPACE_CASES:
1473 OK = TRUE;
1474 break;
1475
1476 default:
1477 OK = FALSE;
1478 break;
1479 }
1480
1481 if (OK == (d == OP_VSPACE))
1482 {
1483 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1484 {
1485 active_count--; /* Remove non-match possibility */
1486 next_active_state--;
1487 }
1488 count++;
1489 ADD_NEW_DATA(-state_offset, count, 0);
1490 }
1491 }
1492 break;
1493
1494 /*-----------------------------------------------------------------*/
1495 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1496 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1497 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1498 count = current_state->count; /* Already matched */
1499 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1500 if (clen > 0)
1501 {
1502 BOOL OK;
1503 switch (c)
1504 {
1505 HSPACE_CASES:
1506 OK = TRUE;
1507 break;
1508
1509 default:
1510 OK = FALSE;
1511 break;
1512 }
1513
1514 if (OK == (d == OP_HSPACE))
1515 {
1516 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1517 {
1518 active_count--; /* Remove non-match possibility */
1519 next_active_state--;
1520 }
1521 count++;
1522 ADD_NEW_DATA(-state_offset, count, 0);
1523 }
1524 }
1525 break;
1526
1527 /*-----------------------------------------------------------------*/
1528 #ifdef SUPPORT_UCP
1529 case OP_PROP_EXTRA + OP_TYPEQUERY:
1530 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1531 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1532 count = 4;
1533 goto QS1;
1534
1535 case OP_PROP_EXTRA + OP_TYPESTAR:
1536 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1537 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1538 count = 0;
1539
1540 QS1:
1541
1542 ADD_ACTIVE(state_offset + 4, 0);
1543 if (clen > 0)
1544 {
1545 BOOL OK;
1546 const pcre_uint32 *cp;
1547 const ucd_record * prop = GET_UCD(c);
1548 switch(code[2])
1549 {
1550 case PT_ANY:
1551 OK = TRUE;
1552 break;
1553
1554 case PT_LAMP:
1555 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1556 prop->chartype == ucp_Lt;
1557 break;
1558
1559 case PT_GC:
1560 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1561 break;
1562
1563 case PT_PC:
1564 OK = prop->chartype == code[3];
1565 break;
1566
1567 case PT_SC:
1568 OK = prop->script == code[3];
1569 break;
1570
1571 /* These are specials for combination cases. */
1572
1573 case PT_ALNUM:
1574 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1575 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1576 break;
1577
1578 case PT_SPACE: /* Perl space */
1579 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1580 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1581 break;
1582
1583 case PT_PXSPACE: /* POSIX space */
1584 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1585 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1586 c == CHAR_FF || c == CHAR_CR;
1587 break;
1588
1589 case PT_WORD:
1590 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1591 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1592 c == CHAR_UNDERSCORE;
1593 break;
1594
1595 case PT_CLIST:
1596 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1597 for (;;)
1598 {
1599 if (c < *cp) { OK = FALSE; break; }
1600 if (c == *cp++) { OK = TRUE; break; }
1601 }
1602 break;
1603
1604 /* Should never occur, but keep compilers from grumbling. */
1605
1606 default:
1607 OK = codevalue != OP_PROP;
1608 break;
1609 }
1610
1611 if (OK == (d == OP_PROP))
1612 {
1613 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1614 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1615 {
1616 active_count--; /* Remove non-match possibility */
1617 next_active_state--;
1618 }
1619 ADD_NEW(state_offset + count, 0);
1620 }
1621 }
1622 break;
1623
1624 /*-----------------------------------------------------------------*/
1625 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1626 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1627 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1628 count = 2;
1629 goto QS2;
1630
1631 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1632 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1633 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1634 count = 0;
1635
1636 QS2:
1637
1638 ADD_ACTIVE(state_offset + 2, 0);
1639 if (clen > 0)
1640 {
1641 int lgb, rgb;
1642 const pcre_uchar *nptr = ptr + clen;
1643 int ncount = 0;
1644 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1645 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1646 {
1647 active_count--; /* Remove non-match possibility */
1648 next_active_state--;
1649 }
1650 lgb = UCD_GRAPHBREAK(c);
1651 while (nptr < end_subject)
1652 {
1653 dlen = 1;
1654 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1655 rgb = UCD_GRAPHBREAK(d);
1656 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1657 ncount++;
1658 lgb = rgb;
1659 nptr += dlen;
1660 }
1661 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1662 }
1663 break;
1664 #endif
1665
1666 /*-----------------------------------------------------------------*/
1667 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1668 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1669 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1670 count = 2;
1671 goto QS3;
1672
1673 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1674 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1675 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1676 count = 0;
1677
1678 QS3:
1679 ADD_ACTIVE(state_offset + 2, 0);
1680 if (clen > 0)
1681 {
1682 int ncount = 0;
1683 switch (c)
1684 {
1685 case CHAR_VT:
1686 case CHAR_FF:
1687 case CHAR_NEL:
1688 #ifndef EBCDIC
1689 case 0x2028:
1690 case 0x2029:
1691 #endif /* Not EBCDIC */
1692 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1693 goto ANYNL02;
1694
1695 case CHAR_CR:
1696 if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
1697 /* Fall through */
1698
1699 ANYNL02:
1700 case CHAR_LF:
1701 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1702 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1703 {
1704 active_count--; /* Remove non-match possibility */
1705 next_active_state--;
1706 }
1707 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1708 break;
1709
1710 default:
1711 break;
1712 }
1713 }
1714 break;
1715
1716 /*-----------------------------------------------------------------*/
1717 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1718 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1719 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1720 count = 2;
1721 goto QS4;
1722
1723 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1724 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1725 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1726 count = 0;
1727
1728 QS4:
1729 ADD_ACTIVE(state_offset + 2, 0);
1730 if (clen > 0)
1731 {
1732 BOOL OK;
1733 switch (c)
1734 {
1735 VSPACE_CASES:
1736 OK = TRUE;
1737 break;
1738
1739 default:
1740 OK = FALSE;
1741 break;
1742 }
1743 if (OK == (d == OP_VSPACE))
1744 {
1745 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1746 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1747 {
1748 active_count--; /* Remove non-match possibility */
1749 next_active_state--;
1750 }
1751 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1752 }
1753 }
1754 break;
1755
1756 /*-----------------------------------------------------------------*/
1757 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1758 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1759 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1760 count = 2;
1761 goto QS5;
1762
1763 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1764 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1765 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1766 count = 0;
1767
1768 QS5:
1769 ADD_ACTIVE(state_offset + 2, 0);
1770 if (clen > 0)
1771 {
1772 BOOL OK;
1773 switch (c)
1774 {
1775 HSPACE_CASES:
1776 OK = TRUE;
1777 break;
1778
1779 default:
1780 OK = FALSE;
1781 break;
1782 }
1783
1784 if (OK == (d == OP_HSPACE))
1785 {
1786 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1787 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1788 {
1789 active_count--; /* Remove non-match possibility */
1790 next_active_state--;
1791 }
1792 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1793 }
1794 }
1795 break;
1796
1797 /*-----------------------------------------------------------------*/
1798 #ifdef SUPPORT_UCP
1799 case OP_PROP_EXTRA + OP_TYPEEXACT:
1800 case OP_PROP_EXTRA + OP_TYPEUPTO:
1801 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1802 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1803 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1804 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1805 count = current_state->count; /* Number already matched */
1806 if (clen > 0)
1807 {
1808 BOOL OK;
1809 const pcre_uint32 *cp;
1810 const ucd_record * prop = GET_UCD(c);
1811 switch(code[1 + IMM2_SIZE + 1])
1812 {
1813 case PT_ANY:
1814 OK = TRUE;
1815 break;
1816
1817 case PT_LAMP:
1818 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1819 prop->chartype == ucp_Lt;
1820 break;
1821
1822 case PT_GC:
1823 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1824 break;
1825
1826 case PT_PC:
1827 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1828 break;
1829
1830 case PT_SC:
1831 OK = prop->script == code[1 + IMM2_SIZE + 2];
1832 break;
1833
1834 /* These are specials for combination cases. */
1835
1836 case PT_ALNUM:
1837 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1838 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1839 break;
1840
1841 case PT_SPACE: /* Perl space */
1842 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1843 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1844 break;
1845
1846 case PT_PXSPACE: /* POSIX space */
1847 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1848 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1849 c == CHAR_FF || c == CHAR_CR;
1850 break;
1851
1852 case PT_WORD:
1853 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1854 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1855 c == CHAR_UNDERSCORE;
1856 break;
1857
1858 case PT_CLIST:
1859 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1860 for (;;)
1861 {
1862 if (c < *cp) { OK = FALSE; break; }
1863 if (c == *cp++) { OK = TRUE; break; }
1864 }
1865 break;
1866
1867 /* Should never occur, but keep compilers from grumbling. */
1868
1869 default:
1870 OK = codevalue != OP_PROP;
1871 break;
1872 }
1873
1874 if (OK == (d == OP_PROP))
1875 {
1876 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1877 {
1878 active_count--; /* Remove non-match possibility */
1879 next_active_state--;
1880 }
1881 if (++count >= GET2(code, 1))
1882 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1883 else
1884 { ADD_NEW(state_offset, count); }
1885 }
1886 }
1887 break;
1888
1889 /*-----------------------------------------------------------------*/
1890 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1891 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1892 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1893 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1894 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1895 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1896 count = current_state->count; /* Number already matched */
1897 if (clen > 0)
1898 {
1899 int lgb, rgb;
1900 const pcre_uchar *nptr = ptr + clen;
1901 int ncount = 0;
1902 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1903 {
1904 active_count--; /* Remove non-match possibility */
1905 next_active_state--;
1906 }
1907 lgb = UCD_GRAPHBREAK(c);
1908 while (nptr < end_subject)
1909 {
1910 dlen = 1;
1911 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1912 rgb = UCD_GRAPHBREAK(d);
1913 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1914 ncount++;
1915 lgb = rgb;
1916 nptr += dlen;
1917 }
1918 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
1919 reset_could_continue = TRUE;
1920 if (++count >= GET2(code, 1))
1921 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1922 else
1923 { ADD_NEW_DATA(-state_offset, count, ncount); }
1924 }
1925 break;
1926 #endif
1927
1928 /*-----------------------------------------------------------------*/
1929 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1930 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1931 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1932 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1933 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1934 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1935 count = current_state->count; /* Number already matched */
1936 if (clen > 0)
1937 {
1938 int ncount = 0;
1939 switch (c)
1940 {
1941 case CHAR_VT:
1942 case CHAR_FF:
1943 case CHAR_NEL:
1944 #ifndef EBCDIC
1945 case 0x2028:
1946 case 0x2029:
1947 #endif /* Not EBCDIC */
1948 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1949 goto ANYNL03;
1950
1951 case CHAR_CR:
1952 if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
1953 /* Fall through */
1954
1955 ANYNL03:
1956 case CHAR_LF:
1957 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1958 {
1959 active_count--; /* Remove non-match possibility */
1960 next_active_state--;
1961 }
1962 if (++count >= GET2(code, 1))
1963 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1964 else
1965 { ADD_NEW_DATA(-state_offset, count, ncount); }
1966 break;
1967
1968 default:
1969 break;
1970 }
1971 }
1972 break;
1973
1974 /*-----------------------------------------------------------------*/
1975 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1976 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1977 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1978 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1979 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1980 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1981 count = current_state->count; /* Number already matched */
1982 if (clen > 0)
1983 {
1984 BOOL OK;
1985 switch (c)
1986 {
1987 VSPACE_CASES:
1988 OK = TRUE;
1989 break;
1990
1991 default:
1992 OK = FALSE;
1993 }
1994
1995 if (OK == (d == OP_VSPACE))
1996 {
1997 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1998 {
1999 active_count--; /* Remove non-match possibility */
2000 next_active_state--;
2001 }
2002 if (++count >= GET2(code, 1))
2003 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2004 else
2005 { ADD_NEW_DATA(-state_offset, count, 0); }
2006 }
2007 }
2008 break;
2009
2010 /*-----------------------------------------------------------------*/
2011 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
2012 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
2013 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
2014 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
2015 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
2016 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2017 count = current_state->count; /* Number already matched */
2018 if (clen > 0)
2019 {
2020 BOOL OK;
2021 switch (c)
2022 {
2023 HSPACE_CASES:
2024 OK = TRUE;
2025 break;
2026
2027 default:
2028 OK = FALSE;
2029 break;
2030 }
2031
2032 if (OK == (d == OP_HSPACE))
2033 {
2034 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
2035 {
2036 active_count--; /* Remove non-match possibility */
2037 next_active_state--;
2038 }
2039 if (++count >= GET2(code, 1))
2040 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2041 else
2042 { ADD_NEW_DATA(-state_offset, count, 0); }
2043 }
2044 }
2045 break;
2046
2047 /* ========================================================================== */
2048 /* These opcodes are followed by a character that is usually compared
2049 to the current subject character; it is loaded into d. We still get
2050 here even if there is no subject character, because in some cases zero
2051 repetitions are permitted. */
2052
2053 /*-----------------------------------------------------------------*/
2054 case OP_CHAR:
2055 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2056 break;
2057
2058 /*-----------------------------------------------------------------*/
2059 case OP_CHARI:
2060 if (clen == 0) break;
2061
2062 #ifdef SUPPORT_UTF
2063 if (utf)
2064 {
2065 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2066 {
2067 unsigned int othercase;
2068 if (c < 128)
2069 othercase = fcc[c];
2070 else
2071 /* If we have Unicode property support, we can use it to test the
2072 other case of the character. */
2073 #ifdef SUPPORT_UCP
2074 othercase = UCD_OTHERCASE(c);
2075 #else
2076 othercase = NOTACHAR;
2077 #endif
2078
2079 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2080 }
2081 }
2082 else
2083 #endif /* SUPPORT_UTF */
2084 /* Not UTF mode */
2085 {
2086 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2087 { ADD_NEW(state_offset + 2, 0); }
2088 }
2089 break;
2090
2091
2092 #ifdef SUPPORT_UCP
2093 /*-----------------------------------------------------------------*/
2094 /* This is a tricky one because it can match more than one character.
2095 Find out how many characters to skip, and then set up a negative state
2096 to wait for them to pass before continuing. */
2097
2098 case OP_EXTUNI:
2099 if (clen > 0)
2100 {
2101 int lgb, rgb;
2102 const pcre_uchar *nptr = ptr + clen;
2103 int ncount = 0;
2104 lgb = UCD_GRAPHBREAK(c);
2105 while (nptr < end_subject)
2106 {
2107 dlen = 1;
2108 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
2109 rgb = UCD_GRAPHBREAK(d);
2110 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
2111 ncount++;
2112 lgb = rgb;
2113 nptr += dlen;
2114 }
2115 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
2116 reset_could_continue = TRUE;
2117 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2118 }
2119 break;
2120 #endif
2121
2122 /*-----------------------------------------------------------------*/
2123 /* This is a tricky like EXTUNI because it too can match more than one
2124 character (when CR is followed by LF). In this case, set up a negative
2125 state to wait for one character to pass before continuing. */
2126
2127 case OP_ANYNL:
2128 if (clen > 0) switch(c)
2129 {
2130 case CHAR_VT:
2131 case CHAR_FF:
2132 case CHAR_NEL:
2133 #ifndef EBCDIC
2134 case 0x2028:
2135 case 0x2029:
2136 #endif /* Not EBCDIC */
2137 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2138
2139 case CHAR_LF:
2140 ADD_NEW(state_offset + 1, 0);
2141 break;
2142
2143 case CHAR_CR:
2144 if (ptr + 1 >= end_subject)
2145 {
2146 ADD_NEW(state_offset + 1, 0);
2147 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
2148 reset_could_continue = TRUE;
2149 }
2150 else if (RAWUCHARTEST(ptr + 1) == CHAR_LF)
2151 {
2152 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2153 }
2154 else
2155 {
2156 ADD_NEW(state_offset + 1, 0);
2157 }
2158 break;
2159 }
2160 break;
2161
2162 /*-----------------------------------------------------------------*/
2163 case OP_NOT_VSPACE:
2164 if (clen > 0) switch(c)
2165 {
2166 VSPACE_CASES:
2167 break;
2168
2169 default:
2170 ADD_NEW(state_offset + 1, 0);
2171 break;
2172 }
2173 break;
2174
2175 /*-----------------------------------------------------------------*/
2176 case OP_VSPACE:
2177 if (clen > 0) switch(c)
2178 {
2179 VSPACE_CASES:
2180 ADD_NEW(state_offset + 1, 0);
2181 break;
2182
2183 default:
2184 break;
2185 }
2186 break;
2187
2188 /*-----------------------------------------------------------------*/
2189 case OP_NOT_HSPACE:
2190 if (clen > 0) switch(c)
2191 {
2192 HSPACE_CASES:
2193 break;
2194
2195 default:
2196 ADD_NEW(state_offset + 1, 0);
2197 break;
2198 }
2199 break;
2200
2201 /*-----------------------------------------------------------------*/
2202 case OP_HSPACE:
2203 if (clen > 0) switch(c)
2204 {
2205 HSPACE_CASES:
2206 ADD_NEW(state_offset + 1, 0);
2207 break;
2208
2209 default:
2210 break;
2211 }
2212 break;
2213
2214 /*-----------------------------------------------------------------*/
2215 /* Match a negated single character casefully. */
2216
2217 case OP_NOT:
2218 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2219 break;
2220
2221 /*-----------------------------------------------------------------*/
2222 /* Match a negated single character caselessly. */
2223
2224 case OP_NOTI:
2225 if (clen > 0)
2226 {
2227 unsigned int otherd;
2228 #ifdef SUPPORT_UTF
2229 if (utf && d >= 128)
2230 {
2231 #ifdef SUPPORT_UCP
2232 otherd = UCD_OTHERCASE(d);
2233 #endif /* SUPPORT_UCP */
2234 }
2235 else
2236 #endif /* SUPPORT_UTF */
2237 otherd = TABLE_GET(d, fcc, d);
2238 if (c != d && c != otherd)
2239 { ADD_NEW(state_offset + dlen + 1, 0); }
2240 }
2241 break;
2242
2243 /*-----------------------------------------------------------------*/
2244 case OP_PLUSI:
2245 case OP_MINPLUSI:
2246 case OP_POSPLUSI:
2247 case OP_NOTPLUSI:
2248 case OP_NOTMINPLUSI:
2249 case OP_NOTPOSPLUSI:
2250 caseless = TRUE;
2251 codevalue -= OP_STARI - OP_STAR;
2252
2253 /* Fall through */
2254 case OP_PLUS:
2255 case OP_MINPLUS:
2256 case OP_POSPLUS:
2257 case OP_NOTPLUS:
2258 case OP_NOTMINPLUS:
2259 case OP_NOTPOSPLUS:
2260 count = current_state->count; /* Already matched */
2261 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2262 if (clen > 0)
2263 {
2264 pcre_uint32 otherd = NOTACHAR;
2265 if (caseless)
2266 {
2267 #ifdef SUPPORT_UTF
2268 if (utf && d >= 128)
2269 {
2270 #ifdef SUPPORT_UCP
2271 otherd = UCD_OTHERCASE(d);
2272 #endif /* SUPPORT_UCP */
2273 }
2274 else
2275 #endif /* SUPPORT_UTF */
2276 otherd = TABLE_GET(d, fcc, d);
2277 }
2278 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2279 {
2280 if (count > 0 &&
2281 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2282 {
2283 active_count--; /* Remove non-match possibility */
2284 next_active_state--;
2285 }
2286 count++;
2287 ADD_NEW(state_offset, count);
2288 }
2289 }
2290 break;
2291
2292 /*-----------------------------------------------------------------*/
2293 case OP_QUERYI:
2294 case OP_MINQUERYI:
2295 case OP_POSQUERYI:
2296 case OP_NOTQUERYI:
2297 case OP_NOTMINQUERYI:
2298 case OP_NOTPOSQUERYI:
2299 caseless = TRUE;
2300 codevalue -= OP_STARI - OP_STAR;
2301 /* Fall through */
2302 case OP_QUERY:
2303 case OP_MINQUERY:
2304 case OP_POSQUERY:
2305 case OP_NOTQUERY:
2306 case OP_NOTMINQUERY:
2307 case OP_NOTPOSQUERY:
2308 ADD_ACTIVE(state_offset + dlen + 1, 0);
2309 if (clen > 0)
2310 {
2311 pcre_uint32 otherd = NOTACHAR;
2312 if (caseless)
2313 {
2314 #ifdef SUPPORT_UTF
2315 if (utf && d >= 128)
2316 {
2317 #ifdef SUPPORT_UCP
2318 otherd = UCD_OTHERCASE(d);
2319 #endif /* SUPPORT_UCP */
2320 }
2321 else
2322 #endif /* SUPPORT_UTF */
2323 otherd = TABLE_GET(d, fcc, d);
2324 }
2325 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2326 {
2327 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2328 {
2329 active_count--; /* Remove non-match possibility */
2330 next_active_state--;
2331 }
2332 ADD_NEW(state_offset + dlen + 1, 0);
2333 }
2334 }
2335 break;
2336
2337 /*-----------------------------------------------------------------*/
2338 case OP_STARI:
2339 case OP_MINSTARI:
2340 case OP_POSSTARI:
2341 case OP_NOTSTARI:
2342 case OP_NOTMINSTARI:
2343 case OP_NOTPOSSTARI:
2344 caseless = TRUE;
2345 codevalue -= OP_STARI - OP_STAR;
2346 /* Fall through */
2347 case OP_STAR:
2348 case OP_MINSTAR:
2349 case OP_POSSTAR:
2350 case OP_NOTSTAR:
2351 case OP_NOTMINSTAR:
2352 case OP_NOTPOSSTAR:
2353 ADD_ACTIVE(state_offset + dlen + 1, 0);
2354 if (clen > 0)
2355 {
2356 pcre_uint32 otherd = NOTACHAR;
2357 if (caseless)
2358 {
2359 #ifdef SUPPORT_UTF
2360 if (utf && d >= 128)
2361 {
2362 #ifdef SUPPORT_UCP
2363 otherd = UCD_OTHERCASE(d);
2364 #endif /* SUPPORT_UCP */
2365 }
2366 else
2367 #endif /* SUPPORT_UTF */
2368 otherd = TABLE_GET(d, fcc, d);
2369 }
2370 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2371 {
2372 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2373 {
2374 active_count--; /* Remove non-match possibility */
2375 next_active_state--;
2376 }
2377 ADD_NEW(state_offset, 0);
2378 }
2379 }
2380 break;
2381
2382 /*-----------------------------------------------------------------*/
2383 case OP_EXACTI:
2384 case OP_NOTEXACTI:
2385 caseless = TRUE;
2386 codevalue -= OP_STARI - OP_STAR;
2387 /* Fall through */
2388 case OP_EXACT:
2389 case OP_NOTEXACT:
2390 count = current_state->count; /* Number already matched */
2391 if (clen > 0)
2392 {
2393 pcre_uint32 otherd = NOTACHAR;
2394 if (caseless)
2395 {
2396 #ifdef SUPPORT_UTF
2397 if (utf && d >= 128)
2398 {
2399 #ifdef SUPPORT_UCP
2400 otherd = UCD_OTHERCASE(d);
2401 #endif /* SUPPORT_UCP */
2402 }
2403 else
2404 #endif /* SUPPORT_UTF */
2405 otherd = TABLE_GET(d, fcc, d);
2406 }
2407 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2408 {
2409 if (++count >= GET2(code, 1))
2410 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2411 else
2412 { ADD_NEW(state_offset, count); }
2413 }
2414 }
2415 break;
2416
2417 /*-----------------------------------------------------------------*/
2418 case OP_UPTOI:
2419 case OP_MINUPTOI:
2420 case OP_POSUPTOI:
2421 case OP_NOTUPTOI:
2422 case OP_NOTMINUPTOI:
2423 case OP_NOTPOSUPTOI:
2424 caseless = TRUE;
2425 codevalue -= OP_STARI - OP_STAR;
2426 /* Fall through */
2427 case OP_UPTO:
2428 case OP_MINUPTO:
2429 case OP_POSUPTO:
2430 case OP_NOTUPTO:
2431 case OP_NOTMINUPTO:
2432 case OP_NOTPOSUPTO:
2433 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2434 count = current_state->count; /* Number already matched */
2435 if (clen > 0)
2436 {
2437 pcre_uint32 otherd = NOTACHAR;
2438 if (caseless)
2439 {
2440 #ifdef SUPPORT_UTF
2441 if (utf && d >= 128)
2442 {
2443 #ifdef SUPPORT_UCP
2444 otherd = UCD_OTHERCASE(d);
2445 #endif /* SUPPORT_UCP */
2446 }
2447 else
2448 #endif /* SUPPORT_UTF */
2449 otherd = TABLE_GET(d, fcc, d);
2450 }
2451 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2452 {
2453 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2454 {
2455 active_count--; /* Remove non-match possibility */
2456 next_active_state--;
2457 }
2458 if (++count >= GET2(code, 1))
2459 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2460 else
2461 { ADD_NEW(state_offset, count); }
2462 }
2463 }
2464 break;
2465
2466
2467 /* ========================================================================== */
2468 /* These are the class-handling opcodes */
2469
2470 case OP_CLASS:
2471 case OP_NCLASS:
2472 case OP_XCLASS:
2473 {
2474 BOOL isinclass = FALSE;
2475 int next_state_offset;
2476 const pcre_uchar *ecode;
2477
2478 /* For a simple class, there is always just a 32-byte table, and we
2479 can set isinclass from it. */
2480
2481 if (codevalue != OP_XCLASS)
2482 {
2483 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2484 if (clen > 0)
2485 {
2486 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2487 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2488 }
2489 }
2490
2491 /* An extended class may have a table or a list of single characters,
2492 ranges, or both, and it may be positive or negative. There's a
2493 function that sorts all this out. */
2494
2495 else
2496 {
2497 ecode = code + GET(code, 1);
2498 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2499 }
2500
2501 /* At this point, isinclass is set for all kinds of class, and ecode
2502 points to the byte after the end of the class. If there is a
2503 quantifier, this is where it will be. */
2504
2505 next_state_offset = (int)(ecode - start_code);
2506
2507 switch (*ecode)
2508 {
2509 case OP_CRSTAR:
2510 case OP_CRMINSTAR:
2511 ADD_ACTIVE(next_state_offset + 1, 0);
2512 if (isinclass) { ADD_NEW(state_offset, 0); }
2513 break;
2514
2515 case OP_CRPLUS:
2516 case OP_CRMINPLUS:
2517 count = current_state->count; /* Already matched */
2518 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2519 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2520 break;
2521
2522 case OP_CRQUERY:
2523 case OP_CRMINQUERY:
2524 ADD_ACTIVE(next_state_offset + 1, 0);
2525 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2526 break;
2527
2528 case OP_CRRANGE:
2529 case OP_CRMINRANGE:
2530 count = current_state->count; /* Already matched */
2531 if (count >= GET2(ecode, 1))
2532 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2533 if (isinclass)
2534 {
2535 int max = GET2(ecode, 1 + IMM2_SIZE);
2536 if (++count >= max && max != 0) /* Max 0 => no limit */
2537 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2538 else
2539 { ADD_NEW(state_offset, count); }
2540 }
2541 break;
2542
2543 default:
2544 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2545 break;
2546 }
2547 }
2548 break;
2549
2550 /* ========================================================================== */
2551 /* These are the opcodes for fancy brackets of various kinds. We have
2552 to use recursion in order to handle them. The "always failing" assertion
2553 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2554 though the other "backtracking verbs" are not supported. */
2555
2556 case OP_FAIL:
2557 forced_fail++; /* Count FAILs for multiple states */
2558 break;
2559
2560 case OP_ASSERT:
2561 case OP_ASSERT_NOT:
2562 case OP_ASSERTBACK:
2563 case OP_ASSERTBACK_NOT:
2564 {
2565 int rc;
2566 int local_offsets[2];
2567 int local_workspace[1000];
2568 const pcre_uchar *endasscode = code + GET(code, 1);
2569
2570 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2571
2572 rc = internal_dfa_exec(
2573 md, /* static match data */
2574 code, /* this subexpression's code */
2575 ptr, /* where we currently are */
2576 (int)(ptr - start_subject), /* start offset */
2577 local_offsets, /* offset vector */
2578 sizeof(local_offsets)/sizeof(int), /* size of same */
2579 local_workspace, /* workspace vector */
2580 sizeof(local_workspace)/sizeof(int), /* size of same */
2581 rlevel); /* function recursion level */
2582
2583 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2584 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2585 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2586 }
2587 break;
2588
2589 /*-----------------------------------------------------------------*/
2590 case OP_COND:
2591 case OP_SCOND:
2592 {
2593 int local_offsets[1000];
2594 int local_workspace[1000];
2595 int codelink = GET(code, 1);
2596 int condcode;
2597
2598 /* Because of the way auto-callout works during compile, a callout item
2599 is inserted between OP_COND and an assertion condition. This does not
2600 happen for the other conditions. */
2601
2602 if (code[LINK_SIZE+1] == OP_CALLOUT)
2603 {
2604 rrc = 0;
2605 if (PUBL(callout) != NULL)
2606 {
2607 PUBL(callout_block) cb;
2608 cb.version = 1; /* Version 1 of the callout block */
2609 cb.callout_number = code[LINK_SIZE+2];
2610 cb.offset_vector = offsets;
2611 #if defined COMPILE_PCRE8
2612 cb.subject = (PCRE_SPTR)start_subject;
2613 #elif defined COMPILE_PCRE16
2614 cb.subject = (PCRE_SPTR16)start_subject;
2615 #elif defined COMPILE_PCRE32
2616 cb.subject = (PCRE_SPTR32)start_subject;
2617 #endif
2618 cb.subject_length = (int)(end_subject - start_subject);
2619 cb.start_match = (int)(current_subject - start_subject);
2620 cb.current_position = (int)(ptr - start_subject);
2621 cb.pattern_position = GET(code, LINK_SIZE + 3);
2622 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2623 cb.capture_top = 1;
2624 cb.capture_last = -1;
2625 cb.callout_data = md->callout_data;
2626 cb.mark = NULL; /* No (*MARK) support */
2627 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2628 }
2629 if (rrc > 0) break; /* Fail this thread */
2630 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2631 }
2632
2633 condcode = code[LINK_SIZE+1];
2634
2635 /* Back reference conditions are not supported */
2636
2637 if (condcode == OP_CREF || condcode == OP_NCREF)
2638 return PCRE_ERROR_DFA_UCOND;
2639
2640 /* The DEFINE condition is always false */
2641
2642 if (condcode == OP_DEF)
2643 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2644
2645 /* The only supported version of OP_RREF is for the value RREF_ANY,
2646 which means "test if in any recursion". We can't test for specifically
2647 recursed groups. */
2648
2649 else if (condcode == OP_RREF || condcode == OP_NRREF)
2650 {
2651 int value = GET2(code, LINK_SIZE + 2);
2652 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2653 if (md->recursive != NULL)
2654 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2655 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2656 }
2657
2658 /* Otherwise, the condition is an assertion */
2659
2660 else
2661 {
2662 int rc;
2663 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2664 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2665
2666 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2667
2668 rc = internal_dfa_exec(
2669 md, /* fixed match data */
2670 asscode, /* this subexpression's code */
2671 ptr, /* where we currently are */
2672 (int)(ptr - start_subject), /* start offset */
2673 local_offsets, /* offset vector */
2674 sizeof(local_offsets)/sizeof(int), /* size of same */
2675 local_workspace, /* workspace vector */
2676 sizeof(local_workspace)/sizeof(int), /* size of same */
2677 rlevel); /* function recursion level */
2678
2679 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2680 if ((rc >= 0) ==
2681 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2682 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2683 else
2684 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2685 }
2686 }
2687 break;
2688
2689 /*-----------------------------------------------------------------*/
2690 case OP_RECURSE:
2691 {
2692 dfa_recursion_info *ri;
2693 int local_offsets[1000];
2694 int local_workspace[1000];
2695 const pcre_uchar *callpat = start_code + GET(code, 1);
2696 int recno = (callpat == md->start_code)? 0 :
2697 GET2(callpat, 1 + LINK_SIZE);
2698 int rc;
2699
2700 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2701
2702 /* Check for repeating a recursion without advancing the subject
2703 pointer. This should catch convoluted mutual recursions. (Some simple
2704 cases are caught at compile time.) */
2705
2706 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2707 if (recno == ri->group_num && ptr == ri->subject_position)
2708 return PCRE_ERROR_RECURSELOOP;
2709
2710 /* Remember this recursion and where we started it so as to
2711 catch infinite loops. */
2712
2713 new_recursive.group_num = recno;
2714 new_recursive.subject_position = ptr;
2715 new_recursive.prevrec = md->recursive;
2716 md->recursive = &new_recursive;
2717
2718 rc = internal_dfa_exec(
2719 md, /* fixed match data */
2720 callpat, /* this subexpression's code */
2721 ptr, /* where we currently are */
2722 (int)(ptr - start_subject), /* start offset */
2723 local_offsets, /* offset vector */
2724 sizeof(local_offsets)/sizeof(int), /* size of same */
2725 local_workspace, /* workspace vector */
2726 sizeof(local_workspace)/sizeof(int), /* size of same */
2727 rlevel); /* function recursion level */
2728
2729 md->recursive = new_recursive.prevrec; /* Done this recursion */
2730
2731 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2732 rc));
2733
2734 /* Ran out of internal offsets */
2735
2736 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2737
2738 /* For each successful matched substring, set up the next state with a
2739 count of characters to skip before trying it. Note that the count is in
2740 characters, not bytes. */
2741
2742 if (rc > 0)
2743 {
2744 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2745 {
2746 int charcount = local_offsets[rc+1] - local_offsets[rc];
2747 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2748 if (utf)
2749 {
2750 const pcre_uchar *p = start_subject + local_offsets[rc];
2751 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2752 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2753 }
2754 #endif
2755 if (charcount > 0)
2756 {
2757 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2758 }
2759 else
2760 {
2761 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2762 }
2763 }
2764 }
2765 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2766 }
2767 break;
2768
2769 /*-----------------------------------------------------------------*/
2770 case OP_BRAPOS:
2771 case OP_SBRAPOS:
2772 case OP_CBRAPOS:
2773 case OP_SCBRAPOS:
2774 case OP_BRAPOSZERO:
2775 {
2776 int charcount, matched_count;
2777 const pcre_uchar *local_ptr = ptr;
2778 BOOL allow_zero;
2779
2780 if (codevalue == OP_BRAPOSZERO)
2781 {
2782 allow_zero = TRUE;
2783 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2784 }
2785 else allow_zero = FALSE;
2786
2787 /* Loop to match the subpattern as many times as possible as if it were
2788 a complete pattern. */
2789
2790 for (matched_count = 0;; matched_count++)
2791 {
2792 int local_offsets[2];
2793 int local_workspace[1000];
2794
2795 int rc = internal_dfa_exec(
2796 md, /* fixed match data */
2797 code, /* this subexpression's code */
2798 local_ptr, /* where we currently are */
2799 (int)(ptr - start_subject), /* start offset */
2800 local_offsets, /* offset vector */
2801 sizeof(local_offsets)/sizeof(int), /* size of same */
2802 local_workspace, /* workspace vector */
2803 sizeof(local_workspace)/sizeof(int), /* size of same */
2804 rlevel); /* function recursion level */
2805
2806 /* Failed to match */
2807
2808 if (rc < 0)
2809 {
2810 if (rc != PCRE_ERROR_NOMATCH) return rc;
2811 break;
2812 }
2813
2814 /* Matched: break the loop if zero characters matched. */
2815
2816 charcount = local_offsets[1] - local_offsets[0];
2817 if (charcount == 0) break;
2818 local_ptr += charcount; /* Advance temporary position ptr */
2819 }
2820
2821 /* At this point we have matched the subpattern matched_count
2822 times, and local_ptr is pointing to the character after the end of the
2823 last match. */
2824
2825 if (matched_count > 0 || allow_zero)
2826 {
2827 const pcre_uchar *end_subpattern = code;
2828 int next_state_offset;
2829
2830 do { end_subpattern += GET(end_subpattern, 1); }
2831 while (*end_subpattern == OP_ALT);
2832 next_state_offset =
2833 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2834
2835 /* Optimization: if there are no more active states, and there
2836 are no new states yet set up, then skip over the subject string
2837 right here, to save looping. Otherwise, set up the new state to swing
2838 into action when the end of the matched substring is reached. */
2839
2840 if (i + 1 >= active_count && new_count == 0)
2841 {
2842 ptr = local_ptr;
2843 clen = 0;
2844 ADD_NEW(next_state_offset, 0);
2845 }
2846 else
2847 {
2848 const pcre_uchar *p = ptr;
2849 const pcre_uchar *pp = local_ptr;
2850 charcount = (int)(pp - p);
2851 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2852 if (utf) while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2853 #endif
2854 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2855 }
2856 }
2857 }
2858 break;
2859
2860 /*-----------------------------------------------------------------*/
2861 case OP_ONCE:
2862 case OP_ONCE_NC:
2863 {
2864 int local_offsets[2];
2865 int local_workspace[1000];
2866
2867 int rc = internal_dfa_exec(
2868 md, /* fixed match data */
2869 code, /* this subexpression's code */
2870 ptr, /* where we currently are */
2871 (int)(ptr - start_subject), /* start offset */
2872 local_offsets, /* offset vector */
2873 sizeof(local_offsets)/sizeof(int), /* size of same */
2874 local_workspace, /* workspace vector */
2875 sizeof(local_workspace)/sizeof(int), /* size of same */
2876 rlevel); /* function recursion level */
2877
2878 if (rc >= 0)
2879 {
2880 const pcre_uchar *end_subpattern = code;
2881 int charcount = local_offsets[1] - local_offsets[0];
2882 int next_state_offset, repeat_state_offset;
2883
2884 do { end_subpattern += GET(end_subpattern, 1); }
2885 while (*end_subpattern == OP_ALT);
2886 next_state_offset =
2887 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2888
2889 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2890 arrange for the repeat state also to be added to the relevant list.
2891 Calculate the offset, or set -1 for no repeat. */
2892
2893 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2894 *end_subpattern == OP_KETRMIN)?
2895 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2896
2897 /* If we have matched an empty string, add the next state at the
2898 current character pointer. This is important so that the duplicate
2899 checking kicks in, which is what breaks infinite loops that match an
2900 empty string. */
2901
2902 if (charcount == 0)
2903 {
2904 ADD_ACTIVE(next_state_offset, 0);
2905 }
2906
2907 /* Optimization: if there are no more active states, and there
2908 are no new states yet set up, then skip over the subject string
2909 right here, to save looping. Otherwise, set up the new state to swing
2910 into action when the end of the matched substring is reached. */
2911
2912 else if (i + 1 >= active_count && new_count == 0)
2913 {
2914 ptr += charcount;
2915 clen = 0;
2916 ADD_NEW(next_state_offset, 0);
2917
2918 /* If we are adding a repeat state at the new character position,
2919 we must fudge things so that it is the only current state.
2920 Otherwise, it might be a duplicate of one we processed before, and
2921 that would cause it to be skipped. */
2922
2923 if (repeat_state_offset >= 0)
2924 {
2925 next_active_state = active_states;
2926 active_count = 0;
2927 i = -1;
2928 ADD_ACTIVE(repeat_state_offset, 0);
2929 }
2930 }
2931 else
2932 {
2933 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2934 if (utf)
2935 {
2936 const pcre_uchar *p = start_subject + local_offsets[0];
2937 const pcre_uchar *pp = start_subject + local_offsets[1];
2938 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2939 }
2940 #endif
2941 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2942 if (repeat_state_offset >= 0)
2943 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2944 }
2945 }
2946 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2947 }
2948 break;
2949
2950
2951 /* ========================================================================== */
2952 /* Handle callouts */
2953
2954 case OP_CALLOUT:
2955 rrc = 0;
2956 if (PUBL(callout) != NULL)
2957 {
2958 PUBL(callout_block) cb;
2959 cb.version = 1; /* Version 1 of the callout block */
2960 cb.callout_number = code[1];
2961 cb.offset_vector = offsets;
2962 #if defined COMPILE_PCRE8
2963 cb.subject = (PCRE_SPTR)start_subject;
2964 #elif defined COMPILE_PCRE16
2965 cb.subject = (PCRE_SPTR16)start_subject;
2966 #elif defined COMPILE_PCRE32
2967 cb.subject = (PCRE_SPTR32)start_subject;
2968 #endif
2969 cb.subject_length = (int)(end_subject - start_subject);
2970 cb.start_match = (int)(current_subject - start_subject);
2971 cb.current_position = (int)(ptr - start_subject);
2972 cb.pattern_position = GET(code, 2);
2973 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2974 cb.capture_top = 1;
2975 cb.capture_last = -1;
2976 cb.callout_data = md->callout_data;
2977 cb.mark = NULL; /* No (*MARK) support */
2978 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2979 }
2980 if (rrc == 0)
2981 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
2982 break;
2983
2984
2985 /* ========================================================================== */
2986 default: /* Unsupported opcode */
2987 return PCRE_ERROR_DFA_UITEM;
2988 }
2989
2990 NEXT_ACTIVE_STATE: continue;
2991
2992 } /* End of loop scanning active states */
2993
2994 /* We have finished the processing at the current subject character. If no
2995 new states have been set for the next character, we have found all the
2996 matches that we are going to find. If we are at the top level and partial
2997 matching has been requested, check for appropriate conditions.
2998
2999 The "forced_ fail" variable counts the number of (*F) encountered for the
3000 character. If it is equal to the original active_count (saved in
3001 workspace[1]) it means that (*F) was found on every active state. In this
3002 case we don't want to give a partial match.
3003
3004 The "could_continue" variable is true if a state could have continued but
3005 for the fact that the end of the subject was reached. */
3006
3007 if (new_count <= 0)
3008 {
3009 if (rlevel == 1 && /* Top level, and */
3010 could_continue && /* Some could go on, and */
3011 forced_fail != workspace[1] && /* Not all forced fail & */
3012 ( /* either... */
3013 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
3014 || /* or... */
3015 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
3016 match_count < 0) /* no matches */
3017 ) && /* And... */
3018 (
3019 partial_newline || /* Either partial NL */
3020 ( /* or ... */
3021 ptr >= end_subject && /* End of subject and */
3022 ptr > md->start_used_ptr) /* Inspected non-empty string */
3023 )
3024 )
3025 {
3026 if (offsetcount >= 2)
3027 {
3028 offsets[0] = (int)(md->start_used_ptr - start_subject);
3029 offsets[1] = (int)(end_subject - start_subject);
3030 }
3031 match_count = PCRE_ERROR_PARTIAL;
3032 }
3033
3034 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
3035 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
3036 rlevel*2-2, SP));
3037 break; /* In effect, "return", but see the comment below */
3038 }
3039
3040 /* One or more states are active for the next character. */
3041
3042 ptr += clen; /* Advance to next subject character */
3043 } /* Loop to move along the subject string */
3044
3045 /* Control gets here from "break" a few lines above. We do it this way because
3046 if we use "return" above, we have compiler trouble. Some compilers warn if
3047 there's nothing here because they think the function doesn't return a value. On
3048 the other hand, if we put a dummy statement here, some more clever compilers
3049 complain that it can't be reached. Sigh. */
3050
3051 return match_count;
3052 }
3053
3054
3055
3056
3057 /*************************************************
3058 * Execute a Regular Expression - DFA engine *
3059 *************************************************/
3060
3061 /* This external function applies a compiled re to a subject string using a DFA
3062 engine. This function calls the internal function multiple times if the pattern
3063 is not anchored.
3064
3065 Arguments:
3066 argument_re points to the compiled expression
3067 extra_data points to extra data or is NULL
3068 subject points to the subject string
3069 length length of subject string (may contain binary zeros)
3070 start_offset where to start in the subject string
3071 options option bits
3072 offsets vector of match offsets
3073 offsetcount size of same
3074 workspace workspace vector
3075 wscount size of same
3076
3077 Returns: > 0 => number of match offset pairs placed in offsets
3078 = 0 => offsets overflowed; longest matches are present
3079 -1 => failed to match
3080 < -1 => some kind of unexpected problem
3081 */
3082
3083 #if defined COMPILE_PCRE8
3084 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3085 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3086 const char *subject, int length, int start_offset, int options, int *offsets,
3087 int offsetcount, int *workspace, int wscount)
3088 #elif defined COMPILE_PCRE16
3089 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3090 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3091 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3092 int offsetcount, int *workspace, int wscount)
3093 #elif defined COMPILE_PCRE32
3094 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3095 pcre32_dfa_exec(const pcre32 *argument_re, const pcre32_extra *extra_data,
3096 PCRE_SPTR32 subject, int length, int start_offset, int options, int *offsets,
3097 int offsetcount, int *workspace, int wscount)
3098 #endif
3099 {
3100 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3101 dfa_match_data match_block;
3102 dfa_match_data *md = &match_block;
3103 BOOL utf, anchored, startline, firstline;
3104 const pcre_uchar *current_subject, *end_subject;
3105 const pcre_study_data *study = NULL;
3106
3107 const pcre_uchar *req_char_ptr;
3108 const pcre_uint8 *start_bits = NULL;
3109 BOOL has_first_char = FALSE;
3110 BOOL has_req_char = FALSE;
3111 pcre_uchar first_char = 0;
3112 pcre_uchar first_char2 = 0;
3113 pcre_uchar req_char = 0;
3114 pcre_uchar req_char2 = 0;
3115 int newline;
3116
3117 /* Plausibility checks */
3118
3119 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3120 if (re == NULL || subject == NULL || workspace == NULL ||
3121 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3122 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3123 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3124 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3125
3126 /* Check that the first field in the block is the magic number. If it is not,
3127 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3128 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3129 means that the pattern is likely compiled with different endianness. */
3130
3131 if (re->magic_number != MAGIC_NUMBER)
3132 return re->magic_number == REVERSED_MAGIC_NUMBER?
3133 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3134 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3135
3136 /* If restarting after a partial match, do some sanity checks on the contents
3137 of the workspace. */
3138
3139 if ((options & PCRE_DFA_RESTART) != 0)
3140 {
3141 if ((workspace[0] & (-2)) != 0 || workspace[1] < 1 ||
3142 workspace[1] > (wscount - 2)/INTS_PER_STATEBLOCK)
3143 return PCRE_ERROR_DFA_BADRESTART;
3144 }
3145
3146 /* Set up study, callout, and table data */
3147
3148 md->tables = re->tables;
3149 md->callout_data = NULL;
3150
3151 if (extra_data != NULL)
3152 {
3153 unsigned int flags = extra_data->flags;
3154 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3155 study = (const pcre_study_data *)extra_data->study_data;
3156 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3157 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3158 return PCRE_ERROR_DFA_UMLIMIT;
3159 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3160 md->callout_data = extra_data->callout_data;
3161 if ((flags & PCRE_EXTRA_TABLES) != 0)
3162 md->tables = extra_data->tables;
3163 }
3164
3165 /* Set some local values */
3166
3167 current_subject = (const pcre_uchar *)subject + start_offset;
3168 end_subject = (const pcre_uchar *)subject + length;
3169 req_char_ptr = current_subject - 1;
3170
3171 #ifdef SUPPORT_UTF
3172 /* PCRE_UTF(16|32) have the same value as PCRE_UTF8. */
3173 utf = (re->options & PCRE_UTF8) != 0;
3174 #else
3175 utf = FALSE;
3176 #endif
3177
3178 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3179 (re->options & PCRE_ANCHORED) != 0;
3180
3181 /* The remaining fixed data for passing around. */
3182
3183 md->start_code = (const pcre_uchar *)argument_re +
3184 re->name_table_offset + re->name_count * re->name_entry_size;
3185 md->start_subject = (const pcre_uchar *)subject;
3186 md->end_subject = end_subject;
3187 md->start_offset = start_offset;
3188 md->moptions = options;
3189 md->poptions = re->options;
3190
3191 /* If the BSR option is not set at match time, copy what was set
3192 at compile time. */
3193
3194 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3195 {
3196 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3197 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3198 #ifdef BSR_ANYCRLF
3199 else md->moptions |= PCRE_BSR_ANYCRLF;
3200 #endif
3201 }
3202
3203 /* Handle different types of newline. The three bits give eight cases. If
3204 nothing is set at run time, whatever was used at compile time applies. */
3205
3206 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3207 PCRE_NEWLINE_BITS)
3208 {
3209 case 0: newline = NEWLINE; break; /* Compile-time default */
3210 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3211 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3212 case PCRE_NEWLINE_CR+
3213 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3214 case PCRE_NEWLINE_ANY: newline = -1; break;
3215 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3216 default: return PCRE_ERROR_BADNEWLINE;
3217 }
3218
3219 if (newline == -2)
3220 {
3221 md->nltype = NLTYPE_ANYCRLF;
3222 }
3223 else if (newline < 0)
3224 {
3225 md->nltype = NLTYPE_ANY;
3226 }
3227 else
3228 {
3229 md->nltype = NLTYPE_FIXED;
3230 if (newline > 255)
3231 {
3232 md->nllen = 2;
3233 md->nl[0] = (newline >> 8) & 255;
3234 md->nl[1] = newline & 255;
3235 }
3236 else
3237 {
3238 md->nllen = 1;
3239 md->nl[0] = newline;
3240 }
3241 }
3242
3243 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3244 back the character offset. */
3245
3246 #ifdef SUPPORT_UTF
3247 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3248 {
3249 int erroroffset;
3250 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3251 if (errorcode != 0)
3252 {
3253 if (offsetcount >= 2)
3254 {
3255 offsets[0] = erroroffset;
3256 offsets[1] = errorcode;
3257 }
3258 #if defined COMPILE_PCRE8
3259 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0) ?
3260 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3261 #elif defined COMPILE_PCRE16
3262 return (errorcode <= PCRE_UTF16_ERR1 && (options & PCRE_PARTIAL_HARD) != 0) ?
3263 PCRE_ERROR_SHORTUTF16 : PCRE_ERROR_BADUTF16;
3264 #elif defined COMPILE_PCRE32
3265 return PCRE_ERROR_BADUTF32;
3266 #endif
3267 }
3268 #if defined COMPILE_PCRE8 || defined COMPILE_PCRE16
3269 if (start_offset > 0 && start_offset < length &&
3270 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3271 return PCRE_ERROR_BADUTF8_OFFSET;
3272 #endif
3273 }
3274 #endif
3275
3276 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3277 is a feature that makes it possible to save compiled regex and re-use them
3278 in other programs later. */
3279
3280 if (md->tables == NULL) md->tables = PRIV(default_tables);
3281
3282 /* The "must be at the start of a line" flags are used in a loop when finding
3283 where to start. */
3284
3285 startline = (re->flags & PCRE_STARTLINE) != 0;
3286 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3287
3288 /* Set up the first character to match, if available. The first_byte value is
3289 never set for an anchored regular expression, but the anchoring may be forced
3290 at run time, so we have to test for anchoring. The first char may be unset for
3291 an unanchored pattern, of course. If there's no first char and the pattern was
3292 studied, there may be a bitmap of possible first characters. */
3293
3294 if (!anchored)
3295 {
3296 if ((re->flags & PCRE_FIRSTSET) != 0)
3297 {
3298 has_first_char = TRUE;
3299 first_char = first_char2 = (pcre_uchar)(re->first_char);
3300 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3301 {
3302 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3303 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3304 if (utf && first_char > 127)
3305 first_char2 = UCD_OTHERCASE(first_char);
3306 #endif
3307 }
3308 }
3309 else
3310 {
3311 if (!startline && study != NULL &&
3312 (study->flags & PCRE_STUDY_MAPPED) != 0)
3313 start_bits = study->start_bits;
3314 }
3315 }
3316
3317 /* For anchored or unanchored matches, there may be a "last known required
3318 character" set. */
3319
3320 if ((re->flags & PCRE_REQCHSET) != 0)
3321 {
3322 has_req_char = TRUE;
3323 req_char = req_char2 = (pcre_uchar)(re->req_char);
3324 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3325 {
3326 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3327 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3328 if (utf && req_char > 127)
3329 req_char2 = UCD_OTHERCASE(req_char);
3330 #endif
3331 }
3332 }
3333
3334 /* Call the main matching function, looping for a non-anchored regex after a
3335 failed match. If not restarting, perform certain optimizations at the start of
3336 a match. */
3337
3338 for (;;)
3339 {
3340 int rc;
3341
3342 if ((options & PCRE_DFA_RESTART) == 0)
3343 {
3344 const pcre_uchar *save_end_subject = end_subject;
3345
3346 /* If firstline is TRUE, the start of the match is constrained to the first
3347 line of a multiline string. Implement this by temporarily adjusting
3348 end_subject so that we stop scanning at a newline. If the match fails at
3349 the newline, later code breaks this loop. */
3350
3351 if (firstline)
3352 {
3353 PCRE_PUCHAR t = current_subject;
3354 #ifdef SUPPORT_UTF
3355 if (utf)
3356 {
3357 while (t < md->end_subject && !IS_NEWLINE(t))
3358 {
3359 t++;
3360 ACROSSCHAR(t < end_subject, *t, t++);
3361 }
3362 }
3363 else
3364 #endif
3365 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3366 end_subject = t;
3367 }
3368
3369 /* There are some optimizations that avoid running the match if a known
3370 starting point is not found. However, there is an option that disables
3371 these, for testing and for ensuring that all callouts do actually occur.
3372 The option can be set in the regex by (*NO_START_OPT) or passed in
3373 match-time options. */
3374
3375 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3376 {
3377 /* Advance to a known first char. */
3378
3379 if (has_first_char)
3380 {
3381 if (first_char != first_char2)
3382 {
3383 pcre_uchar csc;
3384 while (current_subject < end_subject &&
3385 (csc = RAWUCHARTEST(current_subject)) != first_char && csc != first_char2)
3386 current_subject++;
3387 }
3388 else
3389 while (current_subject < end_subject &&
3390 RAWUCHARTEST(current_subject) != first_char)
3391 current_subject++;
3392 }
3393
3394 /* Or to just after a linebreak for a multiline match if possible */
3395
3396 else if (startline)
3397 {
3398 if (current_subject > md->start_subject + start_offset)
3399 {
3400 #ifdef SUPPORT_UTF
3401 if (utf)
3402 {
3403 while (current_subject < end_subject &&
3404 !WAS_NEWLINE(current_subject))
3405 {
3406 current_subject++;
3407 ACROSSCHAR(current_subject < end_subject, *current_subject,
3408 current_subject++);
3409 }
3410 }
3411 else
3412 #endif
3413 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3414 current_subject++;
3415
3416 /* If we have just passed a CR and the newline option is ANY or
3417 ANYCRLF, and we are now at a LF, advance the match position by one
3418 more character. */
3419
3420 if (RAWUCHARTEST(current_subject - 1) == CHAR_CR &&
3421 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3422 current_subject < end_subject &&
3423 RAWUCHARTEST(current_subject) == CHAR_NL)
3424 current_subject++;
3425 }
3426 }
3427
3428 /* Or to a non-unique first char after study */
3429
3430 else if (start_bits != NULL)
3431 {
3432 while (current_subject < end_subject)
3433 {
3434 register pcre_uint32 c = RAWUCHARTEST(current_subject);
3435 #ifndef COMPILE_PCRE8
3436 if (c > 255) c = 255;
3437 #endif
3438 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3439 {
3440 current_subject++;
3441 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3442 /* In non 8-bit mode, the iteration will stop for
3443 characters > 255 at the beginning or not stop at all. */
3444 if (utf)
3445 ACROSSCHAR(current_subject < end_subject, *current_subject,
3446 current_subject++);
3447 #endif
3448 }
3449 else break;
3450 }
3451 }
3452 }
3453
3454 /* Restore fudged end_subject */
3455
3456 end_subject = save_end_subject;
3457
3458 /* The following two optimizations are disabled for partial matching or if
3459 disabling is explicitly requested (and of course, by the test above, this
3460 code is not obeyed when restarting after a partial match). */
3461
3462 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3463 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3464 {
3465 /* If the pattern was studied, a minimum subject length may be set. This
3466 is a lower bound; no actual string of that length may actually match the
3467 pattern. Although the value is, strictly, in characters, we treat it as
3468 bytes to avoid spending too much time in this optimization. */
3469
3470 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3471 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3472 return PCRE_ERROR_NOMATCH;
3473
3474 /* If req_char is set, we know that that character must appear in the
3475 subject for the match to succeed. If the first character is set, req_char
3476 must be later in the subject; otherwise the test starts at the match
3477 point. This optimization can save a huge amount of work in patterns with
3478 nested unlimited repeats that aren't going to match. Writing separate
3479 code for cased/caseless versions makes it go faster, as does using an
3480 autoincrement and backing off on a match.
3481
3482 HOWEVER: when the subject string is very, very long, searching to its end
3483 can take a long time, and give bad performance on quite ordinary
3484 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3485 string... so we don't do this when the string is sufficiently long. */
3486
3487 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3488 {
3489 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3490
3491 /* We don't need to repeat the search if we haven't yet reached the
3492 place we found it at last time. */
3493
3494 if (p > req_char_ptr)
3495 {
3496 if (req_char != req_char2)
3497 {
3498 while (p < end_subject)
3499 {
3500 register pcre_uint32 pp = RAWUCHARINCTEST(p);
3501 if (pp == req_char || pp == req_char2) { p--; break; }
3502 }
3503 }
3504 else
3505 {
3506 while (p < end_subject)
3507 {
3508 if (RAWUCHARINCTEST(p) == req_char) { p--; break; }
3509 }
3510 }
3511
3512 /* If we can't find the required character, break the matching loop,
3513 which will cause a return or PCRE_ERROR_NOMATCH. */
3514
3515 if (p >= end_subject) break;
3516
3517 /* If we have found the required character, save the point where we
3518 found it, so that we don't search again next time round the loop if
3519 the start hasn't passed this character yet. */
3520
3521 req_char_ptr = p;
3522 }
3523 }
3524 }
3525 } /* End of optimizations that are done when not restarting */
3526
3527 /* OK, now we can do the business */
3528
3529 md->start_used_ptr = current_subject;
3530 md->recursive = NULL;
3531
3532 rc = internal_dfa_exec(
3533 md, /* fixed match data */
3534 md->start_code, /* this subexpression's code */
3535 current_subject, /* where we currently are */
3536 start_offset, /* start offset in subject */
3537 offsets, /* offset vector */
3538 offsetcount, /* size of same */
3539 workspace, /* workspace vector */
3540 wscount, /* size of same */
3541 0); /* function recurse level */
3542
3543 /* Anything other than "no match" means we are done, always; otherwise, carry
3544 on only if not anchored. */
3545
3546 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3547
3548 /* Advance to the next subject character unless we are at the end of a line
3549 and firstline is set. */
3550
3551 if (firstline && IS_NEWLINE(current_subject)) break;
3552 current_subject++;
3553 #ifdef SUPPORT_UTF
3554 if (utf)
3555 {
3556 ACROSSCHAR(current_subject < end_subject, *current_subject,
3557 current_subject++);
3558 }
3559 #endif
3560 if (current_subject > end_subject) break;
3561
3562 /* If we have just passed a CR and we are now at a LF, and the pattern does
3563 not contain any explicit matches for \r or \n, and the newline option is CRLF
3564 or ANY or ANYCRLF, advance the match position by one more character. */
3565
3566 if (RAWUCHARTEST(current_subject - 1) == CHAR_CR &&
3567 current_subject < end_subject &&
3568 RAWUCHARTEST(current_subject) == CHAR_NL &&
3569 (re->flags & PCRE_HASCRORLF) == 0 &&
3570 (md->nltype == NLTYPE_ANY ||
3571 md->nltype == NLTYPE_ANYCRLF ||
3572 md->nllen == 2))
3573 current_subject++;
3574
3575 } /* "Bumpalong" loop */
3576
3577 return PCRE_ERROR_NOMATCH;
3578 }
3579
3580 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5