/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1084 - (show annotations)
Tue Oct 16 15:55:28 2012 UTC (6 years, 11 months ago) by chpe
File MIME type: text/plain
File size: 123720 byte(s)
Error occurred while calculating annotation data.
pcre32: More 32-bit cleanliness fixes
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41 /* This module contains the external function pcre_dfa_exec(), which is an
42 alternative matching function that uses a sort of DFA algorithm (not a true
43 FSM). This is NOT Perl-compatible, but it has advantages in certain
44 applications. */
45
46
47 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
48 the performance of his patterns greatly. I could not use it as it stood, as it
49 was not thread safe, and made assumptions about pattern sizes. Also, it caused
50 test 7 to loop, and test 9 to crash with a segfault.
51
52 The issue is the check for duplicate states, which is done by a simple linear
53 search up the state list. (Grep for "duplicate" below to find the code.) For
54 many patterns, there will never be many states active at one time, so a simple
55 linear search is fine. In patterns that have many active states, it might be a
56 bottleneck. The suggested code used an indexing scheme to remember which states
57 had previously been used for each character, and avoided the linear search when
58 it knew there was no chance of a duplicate. This was implemented when adding
59 states to the state lists.
60
61 I wrote some thread-safe, not-limited code to try something similar at the time
62 of checking for duplicates (instead of when adding states), using index vectors
63 on the stack. It did give a 13% improvement with one specially constructed
64 pattern for certain subject strings, but on other strings and on many of the
65 simpler patterns in the test suite it did worse. The major problem, I think,
66 was the extra time to initialize the index. This had to be done for each call
67 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
68 only once - I suspect this was the cause of the problems with the tests.)
69
70 Overall, I concluded that the gains in some cases did not outweigh the losses
71 in others, so I abandoned this code. */
72
73
74
75 #ifdef HAVE_CONFIG_H
76 #include "config.h"
77 #endif
78
79 #define NLBLOCK md /* Block containing newline information */
80 #define PSSTART start_subject /* Field containing processed string start */
81 #define PSEND end_subject /* Field containing processed string end */
82
83 #include "pcre_internal.h"
84
85
86 /* For use to indent debugging output */
87
88 #define SP " "
89
90
91 /*************************************************
92 * Code parameters and static tables *
93 *************************************************/
94
95 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
96 into others, under special conditions. A gap of 20 between the blocks should be
97 enough. The resulting opcodes don't have to be less than 256 because they are
98 never stored, so we push them well clear of the normal opcodes. */
99
100 #define OP_PROP_EXTRA 300
101 #define OP_EXTUNI_EXTRA 320
102 #define OP_ANYNL_EXTRA 340
103 #define OP_HSPACE_EXTRA 360
104 #define OP_VSPACE_EXTRA 380
105
106
107 /* This table identifies those opcodes that are followed immediately by a
108 character that is to be tested in some way. This makes it possible to
109 centralize the loading of these characters. In the case of Type * etc, the
110 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
111 small value. Non-zero values in the table are the offsets from the opcode where
112 the character is to be found. ***NOTE*** If the start of this table is
113 modified, the three tables that follow must also be modified. */
114
115 static const pcre_uint8 coptable[] = {
116 0, /* End */
117 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
118 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
119 0, 0, 0, /* Any, AllAny, Anybyte */
120 0, 0, /* \P, \p */
121 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
122 0, /* \X */
123 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
124 1, /* Char */
125 1, /* Chari */
126 1, /* not */
127 1, /* noti */
128 /* Positive single-char repeats */
129 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
130 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
131 1+IMM2_SIZE, /* exact */
132 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
135 1+IMM2_SIZE, /* exact I */
136 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
137 /* Negative single-char repeats - only for chars < 256 */
138 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
139 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
140 1+IMM2_SIZE, /* NOT exact */
141 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
142 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
143 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
144 1+IMM2_SIZE, /* NOT exact I */
145 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
146 /* Positive type repeats */
147 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
148 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
149 1+IMM2_SIZE, /* Type exact */
150 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
151 /* Character class & ref repeats */
152 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
153 0, 0, /* CRRANGE, CRMINRANGE */
154 0, /* CLASS */
155 0, /* NCLASS */
156 0, /* XCLASS - variable length */
157 0, /* REF */
158 0, /* REFI */
159 0, /* RECURSE */
160 0, /* CALLOUT */
161 0, /* Alt */
162 0, /* Ket */
163 0, /* KetRmax */
164 0, /* KetRmin */
165 0, /* KetRpos */
166 0, /* Reverse */
167 0, /* Assert */
168 0, /* Assert not */
169 0, /* Assert behind */
170 0, /* Assert behind not */
171 0, 0, /* ONCE, ONCE_NC */
172 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
173 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
174 0, 0, /* CREF, NCREF */
175 0, 0, /* RREF, NRREF */
176 0, /* DEF */
177 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
178 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
179 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
180 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
181 0, 0 /* CLOSE, SKIPZERO */
182 };
183
184 /* This table identifies those opcodes that inspect a character. It is used to
185 remember the fact that a character could have been inspected when the end of
186 the subject is reached. ***NOTE*** If the start of this table is modified, the
187 two tables that follow must also be modified. */
188
189 static const pcre_uint8 poptable[] = {
190 0, /* End */
191 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
192 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
193 1, 1, 1, /* Any, AllAny, Anybyte */
194 1, 1, /* \P, \p */
195 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
196 1, /* \X */
197 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
198 1, /* Char */
199 1, /* Chari */
200 1, /* not */
201 1, /* noti */
202 /* Positive single-char repeats */
203 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
204 1, 1, 1, /* upto, minupto, exact */
205 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
206 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
207 1, 1, 1, /* upto I, minupto I, exact I */
208 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
209 /* Negative single-char repeats - only for chars < 256 */
210 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
211 1, 1, 1, /* NOT upto, minupto, exact */
212 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
213 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
214 1, 1, 1, /* NOT upto I, minupto I, exact I */
215 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
216 /* Positive type repeats */
217 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
218 1, 1, 1, /* Type upto, minupto, exact */
219 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
220 /* Character class & ref repeats */
221 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
222 1, 1, /* CRRANGE, CRMINRANGE */
223 1, /* CLASS */
224 1, /* NCLASS */
225 1, /* XCLASS - variable length */
226 0, /* REF */
227 0, /* REFI */
228 0, /* RECURSE */
229 0, /* CALLOUT */
230 0, /* Alt */
231 0, /* Ket */
232 0, /* KetRmax */
233 0, /* KetRmin */
234 0, /* KetRpos */
235 0, /* Reverse */
236 0, /* Assert */
237 0, /* Assert not */
238 0, /* Assert behind */
239 0, /* Assert behind not */
240 0, 0, /* ONCE, ONCE_NC */
241 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
242 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
243 0, 0, /* CREF, NCREF */
244 0, 0, /* RREF, NRREF */
245 0, /* DEF */
246 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
247 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
248 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
249 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
250 0, 0 /* CLOSE, SKIPZERO */
251 };
252
253 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
254 and \w */
255
256 static const pcre_uint8 toptable1[] = {
257 0, 0, 0, 0, 0, 0,
258 ctype_digit, ctype_digit,
259 ctype_space, ctype_space,
260 ctype_word, ctype_word,
261 0, 0 /* OP_ANY, OP_ALLANY */
262 };
263
264 static const pcre_uint8 toptable2[] = {
265 0, 0, 0, 0, 0, 0,
266 ctype_digit, 0,
267 ctype_space, 0,
268 ctype_word, 0,
269 1, 1 /* OP_ANY, OP_ALLANY */
270 };
271
272
273 /* Structure for holding data about a particular state, which is in effect the
274 current data for an active path through the match tree. It must consist
275 entirely of ints because the working vector we are passed, and which we put
276 these structures in, is a vector of ints. */
277
278 typedef struct stateblock {
279 int offset; /* Offset to opcode */
280 int count; /* Count for repeats */
281 int data; /* Some use extra data */
282 } stateblock;
283
284 #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
285
286
287 #ifdef PCRE_DEBUG
288 /*************************************************
289 * Print character string *
290 *************************************************/
291
292 /* Character string printing function for debugging.
293
294 Arguments:
295 p points to string
296 length number of bytes
297 f where to print
298
299 Returns: nothing
300 */
301
302 static void
303 pchars(const pcre_uchar *p, int length, FILE *f)
304 {
305 int c;
306 while (length-- > 0)
307 {
308 if (isprint(c = *(p++)))
309 fprintf(f, "%c", c);
310 else
311 fprintf(f, "\\x%02x", c);
312 }
313 }
314 #endif
315
316
317
318 /*************************************************
319 * Execute a Regular Expression - DFA engine *
320 *************************************************/
321
322 /* This internal function applies a compiled pattern to a subject string,
323 starting at a given point, using a DFA engine. This function is called from the
324 external one, possibly multiple times if the pattern is not anchored. The
325 function calls itself recursively for some kinds of subpattern.
326
327 Arguments:
328 md the match_data block with fixed information
329 this_start_code the opening bracket of this subexpression's code
330 current_subject where we currently are in the subject string
331 start_offset start offset in the subject string
332 offsets vector to contain the matching string offsets
333 offsetcount size of same
334 workspace vector of workspace
335 wscount size of same
336 rlevel function call recursion level
337
338 Returns: > 0 => number of match offset pairs placed in offsets
339 = 0 => offsets overflowed; longest matches are present
340 -1 => failed to match
341 < -1 => some kind of unexpected problem
342
343 The following macros are used for adding states to the two state vectors (one
344 for the current character, one for the following character). */
345
346 #define ADD_ACTIVE(x,y) \
347 if (active_count++ < wscount) \
348 { \
349 next_active_state->offset = (x); \
350 next_active_state->count = (y); \
351 next_active_state++; \
352 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
353 } \
354 else return PCRE_ERROR_DFA_WSSIZE
355
356 #define ADD_ACTIVE_DATA(x,y,z) \
357 if (active_count++ < wscount) \
358 { \
359 next_active_state->offset = (x); \
360 next_active_state->count = (y); \
361 next_active_state->data = (z); \
362 next_active_state++; \
363 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
364 } \
365 else return PCRE_ERROR_DFA_WSSIZE
366
367 #define ADD_NEW(x,y) \
368 if (new_count++ < wscount) \
369 { \
370 next_new_state->offset = (x); \
371 next_new_state->count = (y); \
372 next_new_state++; \
373 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
374 } \
375 else return PCRE_ERROR_DFA_WSSIZE
376
377 #define ADD_NEW_DATA(x,y,z) \
378 if (new_count++ < wscount) \
379 { \
380 next_new_state->offset = (x); \
381 next_new_state->count = (y); \
382 next_new_state->data = (z); \
383 next_new_state++; \
384 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d) line %d\n", rlevel*2-2, SP, \
385 (x), (y), (z), __LINE__)); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 BOOL reset_could_continue = FALSE;
428
429 rlevel++;
430 offsetcount &= (-2);
431
432 wscount -= 2;
433 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
434 (2 * INTS_PER_STATEBLOCK);
435
436 DPRINTF(("\n%.*s---------------------\n"
437 "%.*sCall to internal_dfa_exec f=%d\n",
438 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
439
440 ctypes = md->tables + ctypes_offset;
441 lcc = md->tables + lcc_offset;
442 fcc = md->tables + fcc_offset;
443
444 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
445
446 active_states = (stateblock *)(workspace + 2);
447 next_new_state = new_states = active_states + wscount;
448 new_count = 0;
449
450 first_op = this_start_code + 1 + LINK_SIZE +
451 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
452 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
453 ? IMM2_SIZE:0);
454
455 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
456 the alternative states onto the list, and find out where the end is. This
457 makes is possible to use this function recursively, when we want to stop at a
458 matching internal ket rather than at the end.
459
460 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
461 a backward assertion. In that case, we have to find out the maximum amount to
462 move back, and set up each alternative appropriately. */
463
464 if (*first_op == OP_REVERSE)
465 {
466 int max_back = 0;
467 int gone_back;
468
469 end_code = this_start_code;
470 do
471 {
472 int back = GET(end_code, 2+LINK_SIZE);
473 if (back > max_back) max_back = back;
474 end_code += GET(end_code, 1);
475 }
476 while (*end_code == OP_ALT);
477
478 /* If we can't go back the amount required for the longest lookbehind
479 pattern, go back as far as we can; some alternatives may still be viable. */
480
481 #ifdef SUPPORT_UTF
482 /* In character mode we have to step back character by character */
483
484 if (utf)
485 {
486 for (gone_back = 0; gone_back < max_back; gone_back++)
487 {
488 if (current_subject <= start_subject) break;
489 current_subject--;
490 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
491 }
492 }
493 else
494 #endif
495
496 /* In byte-mode we can do this quickly. */
497
498 {
499 gone_back = (current_subject - max_back < start_subject)?
500 (int)(current_subject - start_subject) : max_back;
501 current_subject -= gone_back;
502 }
503
504 /* Save the earliest consulted character */
505
506 if (current_subject < md->start_used_ptr)
507 md->start_used_ptr = current_subject;
508
509 /* Now we can process the individual branches. */
510
511 end_code = this_start_code;
512 do
513 {
514 int back = GET(end_code, 2+LINK_SIZE);
515 if (back <= gone_back)
516 {
517 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
518 ADD_NEW_DATA(-bstate, 0, gone_back - back);
519 }
520 end_code += GET(end_code, 1);
521 }
522 while (*end_code == OP_ALT);
523 }
524
525 /* This is the code for a "normal" subpattern (not a backward assertion). The
526 start of a whole pattern is always one of these. If we are at the top level,
527 we may be asked to restart matching from the same point that we reached for a
528 previous partial match. We still have to scan through the top-level branches to
529 find the end state. */
530
531 else
532 {
533 end_code = this_start_code;
534
535 /* Restarting */
536
537 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
538 {
539 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
540 new_count = workspace[1];
541 if (!workspace[0])
542 memcpy(new_states, active_states, new_count * sizeof(stateblock));
543 }
544
545 /* Not restarting */
546
547 else
548 {
549 int length = 1 + LINK_SIZE +
550 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
551 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
552 ? IMM2_SIZE:0);
553 do
554 {
555 ADD_NEW((int)(end_code - start_code + length), 0);
556 end_code += GET(end_code, 1);
557 length = 1 + LINK_SIZE;
558 }
559 while (*end_code == OP_ALT);
560 }
561 }
562
563 workspace[0] = 0; /* Bit indicating which vector is current */
564
565 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
566
567 /* Loop for scanning the subject */
568
569 ptr = current_subject;
570 for (;;)
571 {
572 int i, j;
573 int clen, dlen;
574 pcre_uint32 c, d;
575 int forced_fail = 0;
576 BOOL partial_newline = FALSE;
577 BOOL could_continue = reset_could_continue;
578 reset_could_continue = FALSE;
579
580 /* Make the new state list into the active state list and empty the
581 new state list. */
582
583 temp_states = active_states;
584 active_states = new_states;
585 new_states = temp_states;
586 active_count = new_count;
587 new_count = 0;
588
589 workspace[0] ^= 1; /* Remember for the restarting feature */
590 workspace[1] = active_count;
591
592 #ifdef PCRE_DEBUG
593 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
594 pchars(ptr, STRLEN_UC(ptr), stdout);
595 printf("\"\n");
596
597 printf("%.*sActive states: ", rlevel*2-2, SP);
598 for (i = 0; i < active_count; i++)
599 printf("%d/%d ", active_states[i].offset, active_states[i].count);
600 printf("\n");
601 #endif
602
603 /* Set the pointers for adding new states */
604
605 next_active_state = active_states + active_count;
606 next_new_state = new_states;
607
608 /* Load the current character from the subject outside the loop, as many
609 different states may want to look at it, and we assume that at least one
610 will. */
611
612 if (ptr < end_subject)
613 {
614 clen = 1; /* Number of data items in the character */
615 #ifdef SUPPORT_UTF
616 if (utf) { GETCHARLEN(c, ptr, clen); } else
617 #endif /* SUPPORT_UTF */
618 c = *ptr;
619 }
620 else
621 {
622 clen = 0; /* This indicates the end of the subject */
623 c = NOTACHAR; /* This value should never actually be used */
624 }
625
626 /* Scan up the active states and act on each one. The result of an action
627 may be to add more states to the currently active list (e.g. on hitting a
628 parenthesis) or it may be to put states on the new list, for considering
629 when we move the character pointer on. */
630
631 for (i = 0; i < active_count; i++)
632 {
633 stateblock *current_state = active_states + i;
634 BOOL caseless = FALSE;
635 const pcre_uchar *code;
636 int state_offset = current_state->offset;
637 int count, codevalue, rrc;
638
639 #ifdef PCRE_DEBUG
640 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
641 if (clen == 0) printf("EOL\n");
642 else if (c > 32 && c < 127) printf("'%c'\n", c);
643 else printf("0x%02x\n", c);
644 #endif
645
646 /* A negative offset is a special case meaning "hold off going to this
647 (negated) state until the number of characters in the data field have
648 been skipped". If the could_continue flag was passed over from a previous
649 state, arrange for it to passed on. */
650
651 if (state_offset < 0)
652 {
653 if (current_state->data > 0)
654 {
655 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
656 ADD_NEW_DATA(state_offset, current_state->count,
657 current_state->data - 1);
658 if (could_continue) reset_could_continue = TRUE;
659 continue;
660 }
661 else
662 {
663 current_state->offset = state_offset = -state_offset;
664 }
665 }
666
667 /* Check for a duplicate state with the same count, and skip if found.
668 See the note at the head of this module about the possibility of improving
669 performance here. */
670
671 for (j = 0; j < i; j++)
672 {
673 if (active_states[j].offset == state_offset &&
674 active_states[j].count == current_state->count)
675 {
676 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
677 goto NEXT_ACTIVE_STATE;
678 }
679 }
680
681 /* The state offset is the offset to the opcode */
682
683 code = start_code + state_offset;
684 codevalue = *code;
685
686 /* If this opcode inspects a character, but we are at the end of the
687 subject, remember the fact for use when testing for a partial match. */
688
689 if (clen == 0 && poptable[codevalue] != 0)
690 could_continue = TRUE;
691
692 /* If this opcode is followed by an inline character, load it. It is
693 tempting to test for the presence of a subject character here, but that
694 is wrong, because sometimes zero repetitions of the subject are
695 permitted.
696
697 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
698 argument that is not a data character - but is always one byte long because
699 the values are small. We have to take special action to deal with \P, \p,
700 \H, \h, \V, \v and \X in this case. To keep the other cases fast, convert
701 these ones to new opcodes. */
702
703 if (coptable[codevalue] > 0)
704 {
705 dlen = 1;
706 #ifdef SUPPORT_UTF
707 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
708 #endif /* SUPPORT_UTF */
709 d = code[coptable[codevalue]];
710 if (codevalue >= OP_TYPESTAR)
711 {
712 switch(d)
713 {
714 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
715 case OP_NOTPROP:
716 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
717 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
718 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
719 case OP_NOT_HSPACE:
720 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
721 case OP_NOT_VSPACE:
722 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
723 default: break;
724 }
725 }
726 }
727 else
728 {
729 dlen = 0; /* Not strictly necessary, but compilers moan */
730 d = NOTACHAR; /* if these variables are not set. */
731 }
732
733
734 /* Now process the individual opcodes */
735
736 switch (codevalue)
737 {
738 /* ========================================================================== */
739 /* These cases are never obeyed. This is a fudge that causes a compile-
740 time error if the vectors coptable or poptable, which are indexed by
741 opcode, are not the correct length. It seems to be the only way to do
742 such a check at compile time, as the sizeof() operator does not work
743 in the C preprocessor. */
744
745 case OP_TABLE_LENGTH:
746 case OP_TABLE_LENGTH +
747 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
748 (sizeof(poptable) == OP_TABLE_LENGTH)):
749 break;
750
751 /* ========================================================================== */
752 /* Reached a closing bracket. If not at the end of the pattern, carry
753 on with the next opcode. For repeating opcodes, also add the repeat
754 state. Note that KETRPOS will always be encountered at the end of the
755 subpattern, because the possessive subpattern repeats are always handled
756 using recursive calls. Thus, it never adds any new states.
757
758 At the end of the (sub)pattern, unless we have an empty string and
759 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
760 start of the subject, save the match data, shifting up all previous
761 matches so we always have the longest first. */
762
763 case OP_KET:
764 case OP_KETRMIN:
765 case OP_KETRMAX:
766 case OP_KETRPOS:
767 if (code != end_code)
768 {
769 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
770 if (codevalue != OP_KET)
771 {
772 ADD_ACTIVE(state_offset - GET(code, 1), 0);
773 }
774 }
775 else
776 {
777 if (ptr > current_subject ||
778 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
779 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
780 current_subject > start_subject + md->start_offset)))
781 {
782 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
783 else if (match_count > 0 && ++match_count * 2 > offsetcount)
784 match_count = 0;
785 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
786 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
787 if (offsetcount >= 2)
788 {
789 offsets[0] = (int)(current_subject - start_subject);
790 offsets[1] = (int)(ptr - start_subject);
791 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
792 offsets[1] - offsets[0], (char *)current_subject));
793 }
794 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
795 {
796 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
797 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
798 match_count, rlevel*2-2, SP));
799 return match_count;
800 }
801 }
802 }
803 break;
804
805 /* ========================================================================== */
806 /* These opcodes add to the current list of states without looking
807 at the current character. */
808
809 /*-----------------------------------------------------------------*/
810 case OP_ALT:
811 do { code += GET(code, 1); } while (*code == OP_ALT);
812 ADD_ACTIVE((int)(code - start_code), 0);
813 break;
814
815 /*-----------------------------------------------------------------*/
816 case OP_BRA:
817 case OP_SBRA:
818 do
819 {
820 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
821 code += GET(code, 1);
822 }
823 while (*code == OP_ALT);
824 break;
825
826 /*-----------------------------------------------------------------*/
827 case OP_CBRA:
828 case OP_SCBRA:
829 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
830 code += GET(code, 1);
831 while (*code == OP_ALT)
832 {
833 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
834 code += GET(code, 1);
835 }
836 break;
837
838 /*-----------------------------------------------------------------*/
839 case OP_BRAZERO:
840 case OP_BRAMINZERO:
841 ADD_ACTIVE(state_offset + 1, 0);
842 code += 1 + GET(code, 2);
843 while (*code == OP_ALT) code += GET(code, 1);
844 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
845 break;
846
847 /*-----------------------------------------------------------------*/
848 case OP_SKIPZERO:
849 code += 1 + GET(code, 2);
850 while (*code == OP_ALT) code += GET(code, 1);
851 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
852 break;
853
854 /*-----------------------------------------------------------------*/
855 case OP_CIRC:
856 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
857 { ADD_ACTIVE(state_offset + 1, 0); }
858 break;
859
860 /*-----------------------------------------------------------------*/
861 case OP_CIRCM:
862 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
863 (ptr != end_subject && WAS_NEWLINE(ptr)))
864 { ADD_ACTIVE(state_offset + 1, 0); }
865 break;
866
867 /*-----------------------------------------------------------------*/
868 case OP_EOD:
869 if (ptr >= end_subject)
870 {
871 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
872 could_continue = TRUE;
873 else { ADD_ACTIVE(state_offset + 1, 0); }
874 }
875 break;
876
877 /*-----------------------------------------------------------------*/
878 case OP_SOD:
879 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
880 break;
881
882 /*-----------------------------------------------------------------*/
883 case OP_SOM:
884 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
885 break;
886
887
888 /* ========================================================================== */
889 /* These opcodes inspect the next subject character, and sometimes
890 the previous one as well, but do not have an argument. The variable
891 clen contains the length of the current character and is zero if we are
892 at the end of the subject. */
893
894 /*-----------------------------------------------------------------*/
895 case OP_ANY:
896 if (clen > 0 && !IS_NEWLINE(ptr))
897 {
898 if (ptr + 1 >= md->end_subject &&
899 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
900 NLBLOCK->nltype == NLTYPE_FIXED &&
901 NLBLOCK->nllen == 2 &&
902 c == NLBLOCK->nl[0])
903 {
904 could_continue = partial_newline = TRUE;
905 }
906 else
907 {
908 ADD_NEW(state_offset + 1, 0);
909 }
910 }
911 break;
912
913 /*-----------------------------------------------------------------*/
914 case OP_ALLANY:
915 if (clen > 0)
916 { ADD_NEW(state_offset + 1, 0); }
917 break;
918
919 /*-----------------------------------------------------------------*/
920 case OP_EODN:
921 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
922 could_continue = TRUE;
923 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
924 { ADD_ACTIVE(state_offset + 1, 0); }
925 break;
926
927 /*-----------------------------------------------------------------*/
928 case OP_DOLL:
929 if ((md->moptions & PCRE_NOTEOL) == 0)
930 {
931 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
932 could_continue = TRUE;
933 else if (clen == 0 ||
934 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
935 (ptr == end_subject - md->nllen)
936 ))
937 { ADD_ACTIVE(state_offset + 1, 0); }
938 else if (ptr + 1 >= md->end_subject &&
939 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
940 NLBLOCK->nltype == NLTYPE_FIXED &&
941 NLBLOCK->nllen == 2 &&
942 c == NLBLOCK->nl[0])
943 {
944 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
945 {
946 reset_could_continue = TRUE;
947 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
948 }
949 else could_continue = partial_newline = TRUE;
950 }
951 }
952 break;
953
954 /*-----------------------------------------------------------------*/
955 case OP_DOLLM:
956 if ((md->moptions & PCRE_NOTEOL) == 0)
957 {
958 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
959 could_continue = TRUE;
960 else if (clen == 0 ||
961 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
962 { ADD_ACTIVE(state_offset + 1, 0); }
963 else if (ptr + 1 >= md->end_subject &&
964 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
965 NLBLOCK->nltype == NLTYPE_FIXED &&
966 NLBLOCK->nllen == 2 &&
967 c == NLBLOCK->nl[0])
968 {
969 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
970 {
971 reset_could_continue = TRUE;
972 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
973 }
974 else could_continue = partial_newline = TRUE;
975 }
976 }
977 else if (IS_NEWLINE(ptr))
978 { ADD_ACTIVE(state_offset + 1, 0); }
979 break;
980
981 /*-----------------------------------------------------------------*/
982
983 case OP_DIGIT:
984 case OP_WHITESPACE:
985 case OP_WORDCHAR:
986 if (clen > 0 && c < 256 &&
987 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
988 { ADD_NEW(state_offset + 1, 0); }
989 break;
990
991 /*-----------------------------------------------------------------*/
992 case OP_NOT_DIGIT:
993 case OP_NOT_WHITESPACE:
994 case OP_NOT_WORDCHAR:
995 if (clen > 0 && (c >= 256 ||
996 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
997 { ADD_NEW(state_offset + 1, 0); }
998 break;
999
1000 /*-----------------------------------------------------------------*/
1001 case OP_WORD_BOUNDARY:
1002 case OP_NOT_WORD_BOUNDARY:
1003 {
1004 int left_word, right_word;
1005
1006 if (ptr > start_subject)
1007 {
1008 const pcre_uchar *temp = ptr - 1;
1009 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
1010 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
1011 if (utf) { BACKCHAR(temp); }
1012 #endif
1013 GETCHARTEST(d, temp);
1014 #ifdef SUPPORT_UCP
1015 if ((md->poptions & PCRE_UCP) != 0)
1016 {
1017 if (d == '_') left_word = TRUE; else
1018 {
1019 int cat = UCD_CATEGORY(d);
1020 left_word = (cat == ucp_L || cat == ucp_N);
1021 }
1022 }
1023 else
1024 #endif
1025 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1026 }
1027 else left_word = FALSE;
1028
1029 if (clen > 0)
1030 {
1031 #ifdef SUPPORT_UCP
1032 if ((md->poptions & PCRE_UCP) != 0)
1033 {
1034 if (c == '_') right_word = TRUE; else
1035 {
1036 int cat = UCD_CATEGORY(c);
1037 right_word = (cat == ucp_L || cat == ucp_N);
1038 }
1039 }
1040 else
1041 #endif
1042 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1043 }
1044 else right_word = FALSE;
1045
1046 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1047 { ADD_ACTIVE(state_offset + 1, 0); }
1048 }
1049 break;
1050
1051
1052 /*-----------------------------------------------------------------*/
1053 /* Check the next character by Unicode property. We will get here only
1054 if the support is in the binary; otherwise a compile-time error occurs.
1055 */
1056
1057 #ifdef SUPPORT_UCP
1058 case OP_PROP:
1059 case OP_NOTPROP:
1060 if (clen > 0)
1061 {
1062 BOOL OK;
1063 const pcre_uint32 *cp;
1064 const ucd_record * prop = GET_UCD(c);
1065 switch(code[1])
1066 {
1067 case PT_ANY:
1068 OK = TRUE;
1069 break;
1070
1071 case PT_LAMP:
1072 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1073 prop->chartype == ucp_Lt;
1074 break;
1075
1076 case PT_GC:
1077 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1078 break;
1079
1080 case PT_PC:
1081 OK = prop->chartype == code[2];
1082 break;
1083
1084 case PT_SC:
1085 OK = prop->script == code[2];
1086 break;
1087
1088 /* These are specials for combination cases. */
1089
1090 case PT_ALNUM:
1091 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1092 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1093 break;
1094
1095 case PT_SPACE: /* Perl space */
1096 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1097 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1098 break;
1099
1100 case PT_PXSPACE: /* POSIX space */
1101 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1102 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1103 c == CHAR_FF || c == CHAR_CR;
1104 break;
1105
1106 case PT_WORD:
1107 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1108 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1109 c == CHAR_UNDERSCORE;
1110 break;
1111
1112 case PT_CLIST:
1113 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1114 for (;;)
1115 {
1116 if (c < *cp) { OK = FALSE; break; }
1117 if (c == *cp++) { OK = TRUE; break; }
1118 }
1119 break;
1120
1121 /* Should never occur, but keep compilers from grumbling. */
1122
1123 default:
1124 OK = codevalue != OP_PROP;
1125 break;
1126 }
1127
1128 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1129 }
1130 break;
1131 #endif
1132
1133
1134
1135 /* ========================================================================== */
1136 /* These opcodes likewise inspect the subject character, but have an
1137 argument that is not a data character. It is one of these opcodes:
1138 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1139 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1140
1141 case OP_TYPEPLUS:
1142 case OP_TYPEMINPLUS:
1143 case OP_TYPEPOSPLUS:
1144 count = current_state->count; /* Already matched */
1145 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1146 if (clen > 0)
1147 {
1148 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1149 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1150 NLBLOCK->nltype == NLTYPE_FIXED &&
1151 NLBLOCK->nllen == 2 &&
1152 c == NLBLOCK->nl[0])
1153 {
1154 could_continue = partial_newline = TRUE;
1155 }
1156 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1157 (c < 256 &&
1158 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1159 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1160 {
1161 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1162 {
1163 active_count--; /* Remove non-match possibility */
1164 next_active_state--;
1165 }
1166 count++;
1167 ADD_NEW(state_offset, count);
1168 }
1169 }
1170 break;
1171
1172 /*-----------------------------------------------------------------*/
1173 case OP_TYPEQUERY:
1174 case OP_TYPEMINQUERY:
1175 case OP_TYPEPOSQUERY:
1176 ADD_ACTIVE(state_offset + 2, 0);
1177 if (clen > 0)
1178 {
1179 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1180 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1181 NLBLOCK->nltype == NLTYPE_FIXED &&
1182 NLBLOCK->nllen == 2 &&
1183 c == NLBLOCK->nl[0])
1184 {
1185 could_continue = partial_newline = TRUE;
1186 }
1187 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1188 (c < 256 &&
1189 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1190 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1191 {
1192 if (codevalue == OP_TYPEPOSQUERY)
1193 {
1194 active_count--; /* Remove non-match possibility */
1195 next_active_state--;
1196 }
1197 ADD_NEW(state_offset + 2, 0);
1198 }
1199 }
1200 break;
1201
1202 /*-----------------------------------------------------------------*/
1203 case OP_TYPESTAR:
1204 case OP_TYPEMINSTAR:
1205 case OP_TYPEPOSSTAR:
1206 ADD_ACTIVE(state_offset + 2, 0);
1207 if (clen > 0)
1208 {
1209 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1210 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1211 NLBLOCK->nltype == NLTYPE_FIXED &&
1212 NLBLOCK->nllen == 2 &&
1213 c == NLBLOCK->nl[0])
1214 {
1215 could_continue = partial_newline = TRUE;
1216 }
1217 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1218 (c < 256 &&
1219 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1220 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1221 {
1222 if (codevalue == OP_TYPEPOSSTAR)
1223 {
1224 active_count--; /* Remove non-match possibility */
1225 next_active_state--;
1226 }
1227 ADD_NEW(state_offset, 0);
1228 }
1229 }
1230 break;
1231
1232 /*-----------------------------------------------------------------*/
1233 case OP_TYPEEXACT:
1234 count = current_state->count; /* Number already matched */
1235 if (clen > 0)
1236 {
1237 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1238 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1239 NLBLOCK->nltype == NLTYPE_FIXED &&
1240 NLBLOCK->nllen == 2 &&
1241 c == NLBLOCK->nl[0])
1242 {
1243 could_continue = partial_newline = TRUE;
1244 }
1245 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1246 (c < 256 &&
1247 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1248 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1249 {
1250 if (++count >= GET2(code, 1))
1251 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1252 else
1253 { ADD_NEW(state_offset, count); }
1254 }
1255 }
1256 break;
1257
1258 /*-----------------------------------------------------------------*/
1259 case OP_TYPEUPTO:
1260 case OP_TYPEMINUPTO:
1261 case OP_TYPEPOSUPTO:
1262 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1263 count = current_state->count; /* Number already matched */
1264 if (clen > 0)
1265 {
1266 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1267 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1268 NLBLOCK->nltype == NLTYPE_FIXED &&
1269 NLBLOCK->nllen == 2 &&
1270 c == NLBLOCK->nl[0])
1271 {
1272 could_continue = partial_newline = TRUE;
1273 }
1274 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1275 (c < 256 &&
1276 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1277 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1278 {
1279 if (codevalue == OP_TYPEPOSUPTO)
1280 {
1281 active_count--; /* Remove non-match possibility */
1282 next_active_state--;
1283 }
1284 if (++count >= GET2(code, 1))
1285 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1286 else
1287 { ADD_NEW(state_offset, count); }
1288 }
1289 }
1290 break;
1291
1292 /* ========================================================================== */
1293 /* These are virtual opcodes that are used when something like
1294 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1295 argument. It keeps the code above fast for the other cases. The argument
1296 is in the d variable. */
1297
1298 #ifdef SUPPORT_UCP
1299 case OP_PROP_EXTRA + OP_TYPEPLUS:
1300 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1301 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1302 count = current_state->count; /* Already matched */
1303 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1304 if (clen > 0)
1305 {
1306 BOOL OK;
1307 const pcre_uint32 *cp;
1308 const ucd_record * prop = GET_UCD(c);
1309 switch(code[2])
1310 {
1311 case PT_ANY:
1312 OK = TRUE;
1313 break;
1314
1315 case PT_LAMP:
1316 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1317 prop->chartype == ucp_Lt;
1318 break;
1319
1320 case PT_GC:
1321 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1322 break;
1323
1324 case PT_PC:
1325 OK = prop->chartype == code[3];
1326 break;
1327
1328 case PT_SC:
1329 OK = prop->script == code[3];
1330 break;
1331
1332 /* These are specials for combination cases. */
1333
1334 case PT_ALNUM:
1335 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1336 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1337 break;
1338
1339 case PT_SPACE: /* Perl space */
1340 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1341 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1342 break;
1343
1344 case PT_PXSPACE: /* POSIX space */
1345 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1346 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1347 c == CHAR_FF || c == CHAR_CR;
1348 break;
1349
1350 case PT_WORD:
1351 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1352 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1353 c == CHAR_UNDERSCORE;
1354 break;
1355
1356 case PT_CLIST:
1357 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1358 for (;;)
1359 {
1360 if (c < *cp) { OK = FALSE; break; }
1361 if (c == *cp++) { OK = TRUE; break; }
1362 }
1363 break;
1364
1365 /* Should never occur, but keep compilers from grumbling. */
1366
1367 default:
1368 OK = codevalue != OP_PROP;
1369 break;
1370 }
1371
1372 if (OK == (d == OP_PROP))
1373 {
1374 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1375 {
1376 active_count--; /* Remove non-match possibility */
1377 next_active_state--;
1378 }
1379 count++;
1380 ADD_NEW(state_offset, count);
1381 }
1382 }
1383 break;
1384
1385 /*-----------------------------------------------------------------*/
1386 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1387 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1388 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1389 count = current_state->count; /* Already matched */
1390 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1391 if (clen > 0)
1392 {
1393 int lgb, rgb;
1394 const pcre_uchar *nptr = ptr + clen;
1395 int ncount = 0;
1396 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1397 {
1398 active_count--; /* Remove non-match possibility */
1399 next_active_state--;
1400 }
1401 lgb = UCD_GRAPHBREAK(c);
1402 while (nptr < end_subject)
1403 {
1404 dlen = 1;
1405 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1406 rgb = UCD_GRAPHBREAK(d);
1407 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1408 ncount++;
1409 lgb = rgb;
1410 nptr += dlen;
1411 }
1412 count++;
1413 ADD_NEW_DATA(-state_offset, count, ncount);
1414 }
1415 break;
1416 #endif
1417
1418 /*-----------------------------------------------------------------*/
1419 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1420 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1421 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1422 count = current_state->count; /* Already matched */
1423 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1424 if (clen > 0)
1425 {
1426 int ncount = 0;
1427 switch (c)
1428 {
1429 case CHAR_VT:
1430 case CHAR_FF:
1431 case CHAR_NEL:
1432 #ifndef EBCDIC
1433 case 0x2028:
1434 case 0x2029:
1435 #endif /* Not EBCDIC */
1436 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1437 goto ANYNL01;
1438
1439 case CHAR_CR:
1440 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1441 /* Fall through */
1442
1443 ANYNL01:
1444 case CHAR_LF:
1445 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1446 {
1447 active_count--; /* Remove non-match possibility */
1448 next_active_state--;
1449 }
1450 count++;
1451 ADD_NEW_DATA(-state_offset, count, ncount);
1452 break;
1453
1454 default:
1455 break;
1456 }
1457 }
1458 break;
1459
1460 /*-----------------------------------------------------------------*/
1461 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1462 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1463 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1464 count = current_state->count; /* Already matched */
1465 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1466 if (clen > 0)
1467 {
1468 BOOL OK;
1469 switch (c)
1470 {
1471 VSPACE_CASES:
1472 OK = TRUE;
1473 break;
1474
1475 default:
1476 OK = FALSE;
1477 break;
1478 }
1479
1480 if (OK == (d == OP_VSPACE))
1481 {
1482 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1483 {
1484 active_count--; /* Remove non-match possibility */
1485 next_active_state--;
1486 }
1487 count++;
1488 ADD_NEW_DATA(-state_offset, count, 0);
1489 }
1490 }
1491 break;
1492
1493 /*-----------------------------------------------------------------*/
1494 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1495 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1496 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1497 count = current_state->count; /* Already matched */
1498 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1499 if (clen > 0)
1500 {
1501 BOOL OK;
1502 switch (c)
1503 {
1504 HSPACE_CASES:
1505 OK = TRUE;
1506 break;
1507
1508 default:
1509 OK = FALSE;
1510 break;
1511 }
1512
1513 if (OK == (d == OP_HSPACE))
1514 {
1515 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1516 {
1517 active_count--; /* Remove non-match possibility */
1518 next_active_state--;
1519 }
1520 count++;
1521 ADD_NEW_DATA(-state_offset, count, 0);
1522 }
1523 }
1524 break;
1525
1526 /*-----------------------------------------------------------------*/
1527 #ifdef SUPPORT_UCP
1528 case OP_PROP_EXTRA + OP_TYPEQUERY:
1529 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1530 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1531 count = 4;
1532 goto QS1;
1533
1534 case OP_PROP_EXTRA + OP_TYPESTAR:
1535 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1536 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1537 count = 0;
1538
1539 QS1:
1540
1541 ADD_ACTIVE(state_offset + 4, 0);
1542 if (clen > 0)
1543 {
1544 BOOL OK;
1545 const pcre_uint32 *cp;
1546 const ucd_record * prop = GET_UCD(c);
1547 switch(code[2])
1548 {
1549 case PT_ANY:
1550 OK = TRUE;
1551 break;
1552
1553 case PT_LAMP:
1554 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1555 prop->chartype == ucp_Lt;
1556 break;
1557
1558 case PT_GC:
1559 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1560 break;
1561
1562 case PT_PC:
1563 OK = prop->chartype == code[3];
1564 break;
1565
1566 case PT_SC:
1567 OK = prop->script == code[3];
1568 break;
1569
1570 /* These are specials for combination cases. */
1571
1572 case PT_ALNUM:
1573 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1574 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1575 break;
1576
1577 case PT_SPACE: /* Perl space */
1578 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1579 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1580 break;
1581
1582 case PT_PXSPACE: /* POSIX space */
1583 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1584 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1585 c == CHAR_FF || c == CHAR_CR;
1586 break;
1587
1588 case PT_WORD:
1589 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1590 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1591 c == CHAR_UNDERSCORE;
1592 break;
1593
1594 case PT_CLIST:
1595 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1596 for (;;)
1597 {
1598 if (c < *cp) { OK = FALSE; break; }
1599 if (c == *cp++) { OK = TRUE; break; }
1600 }
1601 break;
1602
1603 /* Should never occur, but keep compilers from grumbling. */
1604
1605 default:
1606 OK = codevalue != OP_PROP;
1607 break;
1608 }
1609
1610 if (OK == (d == OP_PROP))
1611 {
1612 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1613 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1614 {
1615 active_count--; /* Remove non-match possibility */
1616 next_active_state--;
1617 }
1618 ADD_NEW(state_offset + count, 0);
1619 }
1620 }
1621 break;
1622
1623 /*-----------------------------------------------------------------*/
1624 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1625 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1626 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1627 count = 2;
1628 goto QS2;
1629
1630 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1631 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1632 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1633 count = 0;
1634
1635 QS2:
1636
1637 ADD_ACTIVE(state_offset + 2, 0);
1638 if (clen > 0)
1639 {
1640 int lgb, rgb;
1641 const pcre_uchar *nptr = ptr + clen;
1642 int ncount = 0;
1643 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1644 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1645 {
1646 active_count--; /* Remove non-match possibility */
1647 next_active_state--;
1648 }
1649 lgb = UCD_GRAPHBREAK(c);
1650 while (nptr < end_subject)
1651 {
1652 dlen = 1;
1653 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1654 rgb = UCD_GRAPHBREAK(d);
1655 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1656 ncount++;
1657 lgb = rgb;
1658 nptr += dlen;
1659 }
1660 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1661 }
1662 break;
1663 #endif
1664
1665 /*-----------------------------------------------------------------*/
1666 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1667 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1668 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1669 count = 2;
1670 goto QS3;
1671
1672 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1673 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1674 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1675 count = 0;
1676
1677 QS3:
1678 ADD_ACTIVE(state_offset + 2, 0);
1679 if (clen > 0)
1680 {
1681 int ncount = 0;
1682 switch (c)
1683 {
1684 case CHAR_VT:
1685 case CHAR_FF:
1686 case CHAR_NEL:
1687 #ifndef EBCDIC
1688 case 0x2028:
1689 case 0x2029:
1690 #endif /* Not EBCDIC */
1691 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1692 goto ANYNL02;
1693
1694 case CHAR_CR:
1695 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1696 /* Fall through */
1697
1698 ANYNL02:
1699 case CHAR_LF:
1700 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1701 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1702 {
1703 active_count--; /* Remove non-match possibility */
1704 next_active_state--;
1705 }
1706 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1707 break;
1708
1709 default:
1710 break;
1711 }
1712 }
1713 break;
1714
1715 /*-----------------------------------------------------------------*/
1716 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1717 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1718 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1719 count = 2;
1720 goto QS4;
1721
1722 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1723 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1724 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1725 count = 0;
1726
1727 QS4:
1728 ADD_ACTIVE(state_offset + 2, 0);
1729 if (clen > 0)
1730 {
1731 BOOL OK;
1732 switch (c)
1733 {
1734 VSPACE_CASES:
1735 OK = TRUE;
1736 break;
1737
1738 default:
1739 OK = FALSE;
1740 break;
1741 }
1742 if (OK == (d == OP_VSPACE))
1743 {
1744 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1745 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1746 {
1747 active_count--; /* Remove non-match possibility */
1748 next_active_state--;
1749 }
1750 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1751 }
1752 }
1753 break;
1754
1755 /*-----------------------------------------------------------------*/
1756 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1757 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1758 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1759 count = 2;
1760 goto QS5;
1761
1762 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1763 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1764 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1765 count = 0;
1766
1767 QS5:
1768 ADD_ACTIVE(state_offset + 2, 0);
1769 if (clen > 0)
1770 {
1771 BOOL OK;
1772 switch (c)
1773 {
1774 HSPACE_CASES:
1775 OK = TRUE;
1776 break;
1777
1778 default:
1779 OK = FALSE;
1780 break;
1781 }
1782
1783 if (OK == (d == OP_HSPACE))
1784 {
1785 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1786 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1787 {
1788 active_count--; /* Remove non-match possibility */
1789 next_active_state--;
1790 }
1791 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1792 }
1793 }
1794 break;
1795
1796 /*-----------------------------------------------------------------*/
1797 #ifdef SUPPORT_UCP
1798 case OP_PROP_EXTRA + OP_TYPEEXACT:
1799 case OP_PROP_EXTRA + OP_TYPEUPTO:
1800 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1801 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1802 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1803 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1804 count = current_state->count; /* Number already matched */
1805 if (clen > 0)
1806 {
1807 BOOL OK;
1808 const pcre_uint32 *cp;
1809 const ucd_record * prop = GET_UCD(c);
1810 switch(code[1 + IMM2_SIZE + 1])
1811 {
1812 case PT_ANY:
1813 OK = TRUE;
1814 break;
1815
1816 case PT_LAMP:
1817 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1818 prop->chartype == ucp_Lt;
1819 break;
1820
1821 case PT_GC:
1822 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1823 break;
1824
1825 case PT_PC:
1826 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1827 break;
1828
1829 case PT_SC:
1830 OK = prop->script == code[1 + IMM2_SIZE + 2];
1831 break;
1832
1833 /* These are specials for combination cases. */
1834
1835 case PT_ALNUM:
1836 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1837 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1838 break;
1839
1840 case PT_SPACE: /* Perl space */
1841 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1842 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1843 break;
1844
1845 case PT_PXSPACE: /* POSIX space */
1846 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1847 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1848 c == CHAR_FF || c == CHAR_CR;
1849 break;
1850
1851 case PT_WORD:
1852 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1853 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1854 c == CHAR_UNDERSCORE;
1855 break;
1856
1857 case PT_CLIST:
1858 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1859 for (;;)
1860 {
1861 if (c < *cp) { OK = FALSE; break; }
1862 if (c == *cp++) { OK = TRUE; break; }
1863 }
1864 break;
1865
1866 /* Should never occur, but keep compilers from grumbling. */
1867
1868 default:
1869 OK = codevalue != OP_PROP;
1870 break;
1871 }
1872
1873 if (OK == (d == OP_PROP))
1874 {
1875 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1876 {
1877 active_count--; /* Remove non-match possibility */
1878 next_active_state--;
1879 }
1880 if (++count >= GET2(code, 1))
1881 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1882 else
1883 { ADD_NEW(state_offset, count); }
1884 }
1885 }
1886 break;
1887
1888 /*-----------------------------------------------------------------*/
1889 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1890 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1891 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1892 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1893 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1894 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1895 count = current_state->count; /* Number already matched */
1896 if (clen > 0)
1897 {
1898 int lgb, rgb;
1899 const pcre_uchar *nptr = ptr + clen;
1900 int ncount = 0;
1901 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1902 {
1903 active_count--; /* Remove non-match possibility */
1904 next_active_state--;
1905 }
1906 lgb = UCD_GRAPHBREAK(c);
1907 while (nptr < end_subject)
1908 {
1909 dlen = 1;
1910 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1911 rgb = UCD_GRAPHBREAK(d);
1912 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1913 ncount++;
1914 lgb = rgb;
1915 nptr += dlen;
1916 }
1917 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
1918 reset_could_continue = TRUE;
1919 if (++count >= GET2(code, 1))
1920 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1921 else
1922 { ADD_NEW_DATA(-state_offset, count, ncount); }
1923 }
1924 break;
1925 #endif
1926
1927 /*-----------------------------------------------------------------*/
1928 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1929 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1930 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1931 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1932 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1933 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1934 count = current_state->count; /* Number already matched */
1935 if (clen > 0)
1936 {
1937 int ncount = 0;
1938 switch (c)
1939 {
1940 case CHAR_VT:
1941 case CHAR_FF:
1942 case CHAR_NEL:
1943 #ifndef EBCDIC
1944 case 0x2028:
1945 case 0x2029:
1946 #endif /* Not EBCDIC */
1947 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1948 goto ANYNL03;
1949
1950 case CHAR_CR:
1951 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1952 /* Fall through */
1953
1954 ANYNL03:
1955 case CHAR_LF:
1956 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1957 {
1958 active_count--; /* Remove non-match possibility */
1959 next_active_state--;
1960 }
1961 if (++count >= GET2(code, 1))
1962 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1963 else
1964 { ADD_NEW_DATA(-state_offset, count, ncount); }
1965 break;
1966
1967 default:
1968 break;
1969 }
1970 }
1971 break;
1972
1973 /*-----------------------------------------------------------------*/
1974 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1975 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1976 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1977 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1978 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1979 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1980 count = current_state->count; /* Number already matched */
1981 if (clen > 0)
1982 {
1983 BOOL OK;
1984 switch (c)
1985 {
1986 VSPACE_CASES:
1987 OK = TRUE;
1988 break;
1989
1990 default:
1991 OK = FALSE;
1992 }
1993
1994 if (OK == (d == OP_VSPACE))
1995 {
1996 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1997 {
1998 active_count--; /* Remove non-match possibility */
1999 next_active_state--;
2000 }
2001 if (++count >= GET2(code, 1))
2002 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2003 else
2004 { ADD_NEW_DATA(-state_offset, count, 0); }
2005 }
2006 }
2007 break;
2008
2009 /*-----------------------------------------------------------------*/
2010 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
2011 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
2012 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
2013 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
2014 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
2015 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2016 count = current_state->count; /* Number already matched */
2017 if (clen > 0)
2018 {
2019 BOOL OK;
2020 switch (c)
2021 {
2022 HSPACE_CASES:
2023 OK = TRUE;
2024 break;
2025
2026 default:
2027 OK = FALSE;
2028 break;
2029 }
2030
2031 if (OK == (d == OP_HSPACE))
2032 {
2033 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
2034 {
2035 active_count--; /* Remove non-match possibility */
2036 next_active_state--;
2037 }
2038 if (++count >= GET2(code, 1))
2039 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2040 else
2041 { ADD_NEW_DATA(-state_offset, count, 0); }
2042 }
2043 }
2044 break;
2045
2046 /* ========================================================================== */
2047 /* These opcodes are followed by a character that is usually compared
2048 to the current subject character; it is loaded into d. We still get
2049 here even if there is no subject character, because in some cases zero
2050 repetitions are permitted. */
2051
2052 /*-----------------------------------------------------------------*/
2053 case OP_CHAR:
2054 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2055 break;
2056
2057 /*-----------------------------------------------------------------*/
2058 case OP_CHARI:
2059 if (clen == 0) break;
2060
2061 #ifdef SUPPORT_UTF
2062 if (utf)
2063 {
2064 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2065 {
2066 unsigned int othercase;
2067 if (c < 128)
2068 othercase = fcc[c];
2069 else
2070 /* If we have Unicode property support, we can use it to test the
2071 other case of the character. */
2072 #ifdef SUPPORT_UCP
2073 othercase = UCD_OTHERCASE(c);
2074 #else
2075 othercase = NOTACHAR;
2076 #endif
2077
2078 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2079 }
2080 }
2081 else
2082 #endif /* SUPPORT_UTF */
2083 /* Not UTF mode */
2084 {
2085 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2086 { ADD_NEW(state_offset + 2, 0); }
2087 }
2088 break;
2089
2090
2091 #ifdef SUPPORT_UCP
2092 /*-----------------------------------------------------------------*/
2093 /* This is a tricky one because it can match more than one character.
2094 Find out how many characters to skip, and then set up a negative state
2095 to wait for them to pass before continuing. */
2096
2097 case OP_EXTUNI:
2098 if (clen > 0)
2099 {
2100 int lgb, rgb;
2101 const pcre_uchar *nptr = ptr + clen;
2102 int ncount = 0;
2103 lgb = UCD_GRAPHBREAK(c);
2104 while (nptr < end_subject)
2105 {
2106 dlen = 1;
2107 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
2108 rgb = UCD_GRAPHBREAK(d);
2109 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
2110 ncount++;
2111 lgb = rgb;
2112 nptr += dlen;
2113 }
2114 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
2115 reset_could_continue = TRUE;
2116 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2117 }
2118 break;
2119 #endif
2120
2121 /*-----------------------------------------------------------------*/
2122 /* This is a tricky like EXTUNI because it too can match more than one
2123 character (when CR is followed by LF). In this case, set up a negative
2124 state to wait for one character to pass before continuing. */
2125
2126 case OP_ANYNL:
2127 if (clen > 0) switch(c)
2128 {
2129 case CHAR_VT:
2130 case CHAR_FF:
2131 case CHAR_NEL:
2132 #ifndef EBCDIC
2133 case 0x2028:
2134 case 0x2029:
2135 #endif /* Not EBCDIC */
2136 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2137
2138 case CHAR_LF:
2139 ADD_NEW(state_offset + 1, 0);
2140 break;
2141
2142 case CHAR_CR:
2143 if (ptr + 1 >= end_subject)
2144 {
2145 ADD_NEW(state_offset + 1, 0);
2146 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
2147 reset_could_continue = TRUE;
2148 }
2149 else if (ptr[1] == CHAR_LF)
2150 {
2151 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2152 }
2153 else
2154 {
2155 ADD_NEW(state_offset + 1, 0);
2156 }
2157 break;
2158 }
2159 break;
2160
2161 /*-----------------------------------------------------------------*/
2162 case OP_NOT_VSPACE:
2163 if (clen > 0) switch(c)
2164 {
2165 VSPACE_CASES:
2166 break;
2167
2168 default:
2169 ADD_NEW(state_offset + 1, 0);
2170 break;
2171 }
2172 break;
2173
2174 /*-----------------------------------------------------------------*/
2175 case OP_VSPACE:
2176 if (clen > 0) switch(c)
2177 {
2178 VSPACE_CASES:
2179 ADD_NEW(state_offset + 1, 0);
2180 break;
2181
2182 default:
2183 break;
2184 }
2185 break;
2186
2187 /*-----------------------------------------------------------------*/
2188 case OP_NOT_HSPACE:
2189 if (clen > 0) switch(c)
2190 {
2191 HSPACE_CASES:
2192 break;
2193
2194 default:
2195 ADD_NEW(state_offset + 1, 0);
2196 break;
2197 }
2198 break;
2199
2200 /*-----------------------------------------------------------------*/
2201 case OP_HSPACE:
2202 if (clen > 0) switch(c)
2203 {
2204 HSPACE_CASES:
2205 ADD_NEW(state_offset + 1, 0);
2206 break;
2207
2208 default:
2209 break;
2210 }
2211 break;
2212
2213 /*-----------------------------------------------------------------*/
2214 /* Match a negated single character casefully. */
2215
2216 case OP_NOT:
2217 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2218 break;
2219
2220 /*-----------------------------------------------------------------*/
2221 /* Match a negated single character caselessly. */
2222
2223 case OP_NOTI:
2224 if (clen > 0)
2225 {
2226 unsigned int otherd;
2227 #ifdef SUPPORT_UTF
2228 if (utf && d >= 128)
2229 {
2230 #ifdef SUPPORT_UCP
2231 otherd = UCD_OTHERCASE(d);
2232 #endif /* SUPPORT_UCP */
2233 }
2234 else
2235 #endif /* SUPPORT_UTF */
2236 otherd = TABLE_GET(d, fcc, d);
2237 if (c != d && c != otherd)
2238 { ADD_NEW(state_offset + dlen + 1, 0); }
2239 }
2240 break;
2241
2242 /*-----------------------------------------------------------------*/
2243 case OP_PLUSI:
2244 case OP_MINPLUSI:
2245 case OP_POSPLUSI:
2246 case OP_NOTPLUSI:
2247 case OP_NOTMINPLUSI:
2248 case OP_NOTPOSPLUSI:
2249 caseless = TRUE;
2250 codevalue -= OP_STARI - OP_STAR;
2251
2252 /* Fall through */
2253 case OP_PLUS:
2254 case OP_MINPLUS:
2255 case OP_POSPLUS:
2256 case OP_NOTPLUS:
2257 case OP_NOTMINPLUS:
2258 case OP_NOTPOSPLUS:
2259 count = current_state->count; /* Already matched */
2260 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2261 if (clen > 0)
2262 {
2263 unsigned int otherd = NOTACHAR;
2264 if (caseless)
2265 {
2266 #ifdef SUPPORT_UTF
2267 if (utf && d >= 128)
2268 {
2269 #ifdef SUPPORT_UCP
2270 otherd = UCD_OTHERCASE(d);
2271 #endif /* SUPPORT_UCP */
2272 }
2273 else
2274 #endif /* SUPPORT_UTF */
2275 otherd = TABLE_GET(d, fcc, d);
2276 }
2277 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2278 {
2279 if (count > 0 &&
2280 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2281 {
2282 active_count--; /* Remove non-match possibility */
2283 next_active_state--;
2284 }
2285 count++;
2286 ADD_NEW(state_offset, count);
2287 }
2288 }
2289 break;
2290
2291 /*-----------------------------------------------------------------*/
2292 case OP_QUERYI:
2293 case OP_MINQUERYI:
2294 case OP_POSQUERYI:
2295 case OP_NOTQUERYI:
2296 case OP_NOTMINQUERYI:
2297 case OP_NOTPOSQUERYI:
2298 caseless = TRUE;
2299 codevalue -= OP_STARI - OP_STAR;
2300 /* Fall through */
2301 case OP_QUERY:
2302 case OP_MINQUERY:
2303 case OP_POSQUERY:
2304 case OP_NOTQUERY:
2305 case OP_NOTMINQUERY:
2306 case OP_NOTPOSQUERY:
2307 ADD_ACTIVE(state_offset + dlen + 1, 0);
2308 if (clen > 0)
2309 {
2310 unsigned int otherd = NOTACHAR;
2311 if (caseless)
2312 {
2313 #ifdef SUPPORT_UTF
2314 if (utf && d >= 128)
2315 {
2316 #ifdef SUPPORT_UCP
2317 otherd = UCD_OTHERCASE(d);
2318 #endif /* SUPPORT_UCP */
2319 }
2320 else
2321 #endif /* SUPPORT_UTF */
2322 otherd = TABLE_GET(d, fcc, d);
2323 }
2324 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2325 {
2326 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2327 {
2328 active_count--; /* Remove non-match possibility */
2329 next_active_state--;
2330 }
2331 ADD_NEW(state_offset + dlen + 1, 0);
2332 }
2333 }
2334 break;
2335
2336 /*-----------------------------------------------------------------*/
2337 case OP_STARI:
2338 case OP_MINSTARI:
2339 case OP_POSSTARI:
2340 case OP_NOTSTARI:
2341 case OP_NOTMINSTARI:
2342 case OP_NOTPOSSTARI:
2343 caseless = TRUE;
2344 codevalue -= OP_STARI - OP_STAR;
2345 /* Fall through */
2346 case OP_STAR:
2347 case OP_MINSTAR:
2348 case OP_POSSTAR:
2349 case OP_NOTSTAR:
2350 case OP_NOTMINSTAR:
2351 case OP_NOTPOSSTAR:
2352 ADD_ACTIVE(state_offset + dlen + 1, 0);
2353 if (clen > 0)
2354 {
2355 unsigned int otherd = NOTACHAR;
2356 if (caseless)
2357 {
2358 #ifdef SUPPORT_UTF
2359 if (utf && d >= 128)
2360 {
2361 #ifdef SUPPORT_UCP
2362 otherd = UCD_OTHERCASE(d);
2363 #endif /* SUPPORT_UCP */
2364 }
2365 else
2366 #endif /* SUPPORT_UTF */
2367 otherd = TABLE_GET(d, fcc, d);
2368 }
2369 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2370 {
2371 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2372 {
2373 active_count--; /* Remove non-match possibility */
2374 next_active_state--;
2375 }
2376 ADD_NEW(state_offset, 0);
2377 }
2378 }
2379 break;
2380
2381 /*-----------------------------------------------------------------*/
2382 case OP_EXACTI:
2383 case OP_NOTEXACTI:
2384 caseless = TRUE;
2385 codevalue -= OP_STARI - OP_STAR;
2386 /* Fall through */
2387 case OP_EXACT:
2388 case OP_NOTEXACT:
2389 count = current_state->count; /* Number already matched */
2390 if (clen > 0)
2391 {
2392 unsigned int otherd = NOTACHAR;
2393 if (caseless)
2394 {
2395 #ifdef SUPPORT_UTF
2396 if (utf && d >= 128)
2397 {
2398 #ifdef SUPPORT_UCP
2399 otherd = UCD_OTHERCASE(d);
2400 #endif /* SUPPORT_UCP */
2401 }
2402 else
2403 #endif /* SUPPORT_UTF */
2404 otherd = TABLE_GET(d, fcc, d);
2405 }
2406 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2407 {
2408 if (++count >= GET2(code, 1))
2409 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2410 else
2411 { ADD_NEW(state_offset, count); }
2412 }
2413 }
2414 break;
2415
2416 /*-----------------------------------------------------------------*/
2417 case OP_UPTOI:
2418 case OP_MINUPTOI:
2419 case OP_POSUPTOI:
2420 case OP_NOTUPTOI:
2421 case OP_NOTMINUPTOI:
2422 case OP_NOTPOSUPTOI:
2423 caseless = TRUE;
2424 codevalue -= OP_STARI - OP_STAR;
2425 /* Fall through */
2426 case OP_UPTO:
2427 case OP_MINUPTO:
2428 case OP_POSUPTO:
2429 case OP_NOTUPTO:
2430 case OP_NOTMINUPTO:
2431 case OP_NOTPOSUPTO:
2432 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2433 count = current_state->count; /* Number already matched */
2434 if (clen > 0)
2435 {
2436 unsigned int otherd = NOTACHAR;
2437 if (caseless)
2438 {
2439 #ifdef SUPPORT_UTF
2440 if (utf && d >= 128)
2441 {
2442 #ifdef SUPPORT_UCP
2443 otherd = UCD_OTHERCASE(d);
2444 #endif /* SUPPORT_UCP */
2445 }
2446 else
2447 #endif /* SUPPORT_UTF */
2448 otherd = TABLE_GET(d, fcc, d);
2449 }
2450 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2451 {
2452 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2453 {
2454 active_count--; /* Remove non-match possibility */
2455 next_active_state--;
2456 }
2457 if (++count >= GET2(code, 1))
2458 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2459 else
2460 { ADD_NEW(state_offset, count); }
2461 }
2462 }
2463 break;
2464
2465
2466 /* ========================================================================== */
2467 /* These are the class-handling opcodes */
2468
2469 case OP_CLASS:
2470 case OP_NCLASS:
2471 case OP_XCLASS:
2472 {
2473 BOOL isinclass = FALSE;
2474 int next_state_offset;
2475 const pcre_uchar *ecode;
2476
2477 /* For a simple class, there is always just a 32-byte table, and we
2478 can set isinclass from it. */
2479
2480 if (codevalue != OP_XCLASS)
2481 {
2482 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2483 if (clen > 0)
2484 {
2485 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2486 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2487 }
2488 }
2489
2490 /* An extended class may have a table or a list of single characters,
2491 ranges, or both, and it may be positive or negative. There's a
2492 function that sorts all this out. */
2493
2494 else
2495 {
2496 ecode = code + GET(code, 1);
2497 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2498 }
2499
2500 /* At this point, isinclass is set for all kinds of class, and ecode
2501 points to the byte after the end of the class. If there is a
2502 quantifier, this is where it will be. */
2503
2504 next_state_offset = (int)(ecode - start_code);
2505
2506 switch (*ecode)
2507 {
2508 case OP_CRSTAR:
2509 case OP_CRMINSTAR:
2510 ADD_ACTIVE(next_state_offset + 1, 0);
2511 if (isinclass) { ADD_NEW(state_offset, 0); }
2512 break;
2513
2514 case OP_CRPLUS:
2515 case OP_CRMINPLUS:
2516 count = current_state->count; /* Already matched */
2517 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2518 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2519 break;
2520
2521 case OP_CRQUERY:
2522 case OP_CRMINQUERY:
2523 ADD_ACTIVE(next_state_offset + 1, 0);
2524 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2525 break;
2526
2527 case OP_CRRANGE:
2528 case OP_CRMINRANGE:
2529 count = current_state->count; /* Already matched */
2530 if (count >= GET2(ecode, 1))
2531 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2532 if (isinclass)
2533 {
2534 int max = GET2(ecode, 1 + IMM2_SIZE);
2535 if (++count >= max && max != 0) /* Max 0 => no limit */
2536 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2537 else
2538 { ADD_NEW(state_offset, count); }
2539 }
2540 break;
2541
2542 default:
2543 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2544 break;
2545 }
2546 }
2547 break;
2548
2549 /* ========================================================================== */
2550 /* These are the opcodes for fancy brackets of various kinds. We have
2551 to use recursion in order to handle them. The "always failing" assertion
2552 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2553 though the other "backtracking verbs" are not supported. */
2554
2555 case OP_FAIL:
2556 forced_fail++; /* Count FAILs for multiple states */
2557 break;
2558
2559 case OP_ASSERT:
2560 case OP_ASSERT_NOT:
2561 case OP_ASSERTBACK:
2562 case OP_ASSERTBACK_NOT:
2563 {
2564 int rc;
2565 int local_offsets[2];
2566 int local_workspace[1000];
2567 const pcre_uchar *endasscode = code + GET(code, 1);
2568
2569 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2570
2571 rc = internal_dfa_exec(
2572 md, /* static match data */
2573 code, /* this subexpression's code */
2574 ptr, /* where we currently are */
2575 (int)(ptr - start_subject), /* start offset */
2576 local_offsets, /* offset vector */
2577 sizeof(local_offsets)/sizeof(int), /* size of same */
2578 local_workspace, /* workspace vector */
2579 sizeof(local_workspace)/sizeof(int), /* size of same */
2580 rlevel); /* function recursion level */
2581
2582 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2583 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2584 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2585 }
2586 break;
2587
2588 /*-----------------------------------------------------------------*/
2589 case OP_COND:
2590 case OP_SCOND:
2591 {
2592 int local_offsets[1000];
2593 int local_workspace[1000];
2594 int codelink = GET(code, 1);
2595 int condcode;
2596
2597 /* Because of the way auto-callout works during compile, a callout item
2598 is inserted between OP_COND and an assertion condition. This does not
2599 happen for the other conditions. */
2600
2601 if (code[LINK_SIZE+1] == OP_CALLOUT)
2602 {
2603 rrc = 0;
2604 if (PUBL(callout) != NULL)
2605 {
2606 PUBL(callout_block) cb;
2607 cb.version = 1; /* Version 1 of the callout block */
2608 cb.callout_number = code[LINK_SIZE+2];
2609 cb.offset_vector = offsets;
2610 #if defined COMPILE_PCRE8
2611 cb.subject = (PCRE_SPTR)start_subject;
2612 #elif defined COMPILE_PCRE16
2613 cb.subject = (PCRE_SPTR16)start_subject;
2614 #elif defined COMPILE_PCRE32
2615 cb.subject = (PCRE_SPTR32)start_subject;
2616 #endif
2617 cb.subject_length = (int)(end_subject - start_subject);
2618 cb.start_match = (int)(current_subject - start_subject);
2619 cb.current_position = (int)(ptr - start_subject);
2620 cb.pattern_position = GET(code, LINK_SIZE + 3);
2621 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2622 cb.capture_top = 1;
2623 cb.capture_last = -1;
2624 cb.callout_data = md->callout_data;
2625 cb.mark = NULL; /* No (*MARK) support */
2626 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2627 }
2628 if (rrc > 0) break; /* Fail this thread */
2629 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2630 }
2631
2632 condcode = code[LINK_SIZE+1];
2633
2634 /* Back reference conditions are not supported */
2635
2636 if (condcode == OP_CREF || condcode == OP_NCREF)
2637 return PCRE_ERROR_DFA_UCOND;
2638
2639 /* The DEFINE condition is always false */
2640
2641 if (condcode == OP_DEF)
2642 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2643
2644 /* The only supported version of OP_RREF is for the value RREF_ANY,
2645 which means "test if in any recursion". We can't test for specifically
2646 recursed groups. */
2647
2648 else if (condcode == OP_RREF || condcode == OP_NRREF)
2649 {
2650 int value = GET2(code, LINK_SIZE + 2);
2651 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2652 if (md->recursive != NULL)
2653 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2654 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2655 }
2656
2657 /* Otherwise, the condition is an assertion */
2658
2659 else
2660 {
2661 int rc;
2662 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2663 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2664
2665 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2666
2667 rc = internal_dfa_exec(
2668 md, /* fixed match data */
2669 asscode, /* this subexpression's code */
2670 ptr, /* where we currently are */
2671 (int)(ptr - start_subject), /* start offset */
2672 local_offsets, /* offset vector */
2673 sizeof(local_offsets)/sizeof(int), /* size of same */
2674 local_workspace, /* workspace vector */
2675 sizeof(local_workspace)/sizeof(int), /* size of same */
2676 rlevel); /* function recursion level */
2677
2678 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2679 if ((rc >= 0) ==
2680 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2681 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2682 else
2683 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2684 }
2685 }
2686 break;
2687
2688 /*-----------------------------------------------------------------*/
2689 case OP_RECURSE:
2690 {
2691 dfa_recursion_info *ri;
2692 int local_offsets[1000];
2693 int local_workspace[1000];
2694 const pcre_uchar *callpat = start_code + GET(code, 1);
2695 int recno = (callpat == md->start_code)? 0 :
2696 GET2(callpat, 1 + LINK_SIZE);
2697 int rc;
2698
2699 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2700
2701 /* Check for repeating a recursion without advancing the subject
2702 pointer. This should catch convoluted mutual recursions. (Some simple
2703 cases are caught at compile time.) */
2704
2705 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2706 if (recno == ri->group_num && ptr == ri->subject_position)
2707 return PCRE_ERROR_RECURSELOOP;
2708
2709 /* Remember this recursion and where we started it so as to
2710 catch infinite loops. */
2711
2712 new_recursive.group_num = recno;
2713 new_recursive.subject_position = ptr;
2714 new_recursive.prevrec = md->recursive;
2715 md->recursive = &new_recursive;
2716
2717 rc = internal_dfa_exec(
2718 md, /* fixed match data */
2719 callpat, /* this subexpression's code */
2720 ptr, /* where we currently are */
2721 (int)(ptr - start_subject), /* start offset */
2722 local_offsets, /* offset vector */
2723 sizeof(local_offsets)/sizeof(int), /* size of same */
2724 local_workspace, /* workspace vector */
2725 sizeof(local_workspace)/sizeof(int), /* size of same */
2726 rlevel); /* function recursion level */
2727
2728 md->recursive = new_recursive.prevrec; /* Done this recursion */
2729
2730 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2731 rc));
2732
2733 /* Ran out of internal offsets */
2734
2735 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2736
2737 /* For each successful matched substring, set up the next state with a
2738 count of characters to skip before trying it. Note that the count is in
2739 characters, not bytes. */
2740
2741 if (rc > 0)
2742 {
2743 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2744 {
2745 int charcount = local_offsets[rc+1] - local_offsets[rc];
2746 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2747 if (utf)
2748 {
2749 const pcre_uchar *p = start_subject + local_offsets[rc];
2750 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2751 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2752 }
2753 #endif
2754 if (charcount > 0)
2755 {
2756 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2757 }
2758 else
2759 {
2760 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2761 }
2762 }
2763 }
2764 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2765 }
2766 break;
2767
2768 /*-----------------------------------------------------------------*/
2769 case OP_BRAPOS:
2770 case OP_SBRAPOS:
2771 case OP_CBRAPOS:
2772 case OP_SCBRAPOS:
2773 case OP_BRAPOSZERO:
2774 {
2775 int charcount, matched_count;
2776 const pcre_uchar *local_ptr = ptr;
2777 BOOL allow_zero;
2778
2779 if (codevalue == OP_BRAPOSZERO)
2780 {
2781 allow_zero = TRUE;
2782 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2783 }
2784 else allow_zero = FALSE;
2785
2786 /* Loop to match the subpattern as many times as possible as if it were
2787 a complete pattern. */
2788
2789 for (matched_count = 0;; matched_count++)
2790 {
2791 int local_offsets[2];
2792 int local_workspace[1000];
2793
2794 int rc = internal_dfa_exec(
2795 md, /* fixed match data */
2796 code, /* this subexpression's code */
2797 local_ptr, /* where we currently are */
2798 (int)(ptr - start_subject), /* start offset */
2799 local_offsets, /* offset vector */
2800 sizeof(local_offsets)/sizeof(int), /* size of same */
2801 local_workspace, /* workspace vector */
2802 sizeof(local_workspace)/sizeof(int), /* size of same */
2803 rlevel); /* function recursion level */
2804
2805 /* Failed to match */
2806
2807 if (rc < 0)
2808 {
2809 if (rc != PCRE_ERROR_NOMATCH) return rc;
2810 break;
2811 }
2812
2813 /* Matched: break the loop if zero characters matched. */
2814
2815 charcount = local_offsets[1] - local_offsets[0];
2816 if (charcount == 0) break;
2817 local_ptr += charcount; /* Advance temporary position ptr */
2818 }
2819
2820 /* At this point we have matched the subpattern matched_count
2821 times, and local_ptr is pointing to the character after the end of the
2822 last match. */
2823
2824 if (matched_count > 0 || allow_zero)
2825 {
2826 const pcre_uchar *end_subpattern = code;
2827 int next_state_offset;
2828
2829 do { end_subpattern += GET(end_subpattern, 1); }
2830 while (*end_subpattern == OP_ALT);
2831 next_state_offset =
2832 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2833
2834 /* Optimization: if there are no more active states, and there
2835 are no new states yet set up, then skip over the subject string
2836 right here, to save looping. Otherwise, set up the new state to swing
2837 into action when the end of the matched substring is reached. */
2838
2839 if (i + 1 >= active_count && new_count == 0)
2840 {
2841 ptr = local_ptr;
2842 clen = 0;
2843 ADD_NEW(next_state_offset, 0);
2844 }
2845 else
2846 {
2847 const pcre_uchar *p = ptr;
2848 const pcre_uchar *pp = local_ptr;
2849 charcount = (int)(pp - p);
2850 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2851 if (utf) while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2852 #endif
2853 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2854 }
2855 }
2856 }
2857 break;
2858
2859 /*-----------------------------------------------------------------*/
2860 case OP_ONCE:
2861 case OP_ONCE_NC:
2862 {
2863 int local_offsets[2];
2864 int local_workspace[1000];
2865
2866 int rc = internal_dfa_exec(
2867 md, /* fixed match data */
2868 code, /* this subexpression's code */
2869 ptr, /* where we currently are */
2870 (int)(ptr - start_subject), /* start offset */
2871 local_offsets, /* offset vector */
2872 sizeof(local_offsets)/sizeof(int), /* size of same */
2873 local_workspace, /* workspace vector */
2874 sizeof(local_workspace)/sizeof(int), /* size of same */
2875 rlevel); /* function recursion level */
2876
2877 if (rc >= 0)
2878 {
2879 const pcre_uchar *end_subpattern = code;
2880 int charcount = local_offsets[1] - local_offsets[0];
2881 int next_state_offset, repeat_state_offset;
2882
2883 do { end_subpattern += GET(end_subpattern, 1); }
2884 while (*end_subpattern == OP_ALT);
2885 next_state_offset =
2886 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2887
2888 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2889 arrange for the repeat state also to be added to the relevant list.
2890 Calculate the offset, or set -1 for no repeat. */
2891
2892 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2893 *end_subpattern == OP_KETRMIN)?
2894 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2895
2896 /* If we have matched an empty string, add the next state at the
2897 current character pointer. This is important so that the duplicate
2898 checking kicks in, which is what breaks infinite loops that match an
2899 empty string. */
2900
2901 if (charcount == 0)
2902 {
2903 ADD_ACTIVE(next_state_offset, 0);
2904 }
2905
2906 /* Optimization: if there are no more active states, and there
2907 are no new states yet set up, then skip over the subject string
2908 right here, to save looping. Otherwise, set up the new state to swing
2909 into action when the end of the matched substring is reached. */
2910
2911 else if (i + 1 >= active_count && new_count == 0)
2912 {
2913 ptr += charcount;
2914 clen = 0;
2915 ADD_NEW(next_state_offset, 0);
2916
2917 /* If we are adding a repeat state at the new character position,
2918 we must fudge things so that it is the only current state.
2919 Otherwise, it might be a duplicate of one we processed before, and
2920 that would cause it to be skipped. */
2921
2922 if (repeat_state_offset >= 0)
2923 {
2924 next_active_state = active_states;
2925 active_count = 0;
2926 i = -1;
2927 ADD_ACTIVE(repeat_state_offset, 0);
2928 }
2929 }
2930 else
2931 {
2932 #if defined SUPPORT_UTF && !defined COMPILE_PCRE32
2933 if (utf)
2934 {
2935 const pcre_uchar *p = start_subject + local_offsets[0];
2936 const pcre_uchar *pp = start_subject + local_offsets[1];
2937 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2938 }
2939 #endif
2940 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2941 if (repeat_state_offset >= 0)
2942 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2943 }
2944 }
2945 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2946 }
2947 break;
2948
2949
2950 /* ========================================================================== */
2951 /* Handle callouts */
2952
2953 case OP_CALLOUT:
2954 rrc = 0;
2955 if (PUBL(callout) != NULL)
2956 {
2957 PUBL(callout_block) cb;
2958 cb.version = 1; /* Version 1 of the callout block */
2959 cb.callout_number = code[1];
2960 cb.offset_vector = offsets;
2961 #if defined COMPILE_PCRE8
2962 cb.subject = (PCRE_SPTR)start_subject;
2963 #elif defined COMPILE_PCRE16
2964 cb.subject = (PCRE_SPTR16)start_subject;
2965 #elif defined COMPILE_PCRE32
2966 cb.subject = (PCRE_SPTR32)start_subject;
2967 #endif
2968 cb.subject_length = (int)(end_subject - start_subject);
2969 cb.start_match = (int)(current_subject - start_subject);
2970 cb.current_position = (int)(ptr - start_subject);
2971 cb.pattern_position = GET(code, 2);
2972 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2973 cb.capture_top = 1;
2974 cb.capture_last = -1;
2975 cb.callout_data = md->callout_data;
2976 cb.mark = NULL; /* No (*MARK) support */
2977 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2978 }
2979 if (rrc == 0)
2980 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
2981 break;
2982
2983
2984 /* ========================================================================== */
2985 default: /* Unsupported opcode */
2986 return PCRE_ERROR_DFA_UITEM;
2987 }
2988
2989 NEXT_ACTIVE_STATE: continue;
2990
2991 } /* End of loop scanning active states */
2992
2993 /* We have finished the processing at the current subject character. If no
2994 new states have been set for the next character, we have found all the
2995 matches that we are going to find. If we are at the top level and partial
2996 matching has been requested, check for appropriate conditions.
2997
2998 The "forced_ fail" variable counts the number of (*F) encountered for the
2999 character. If it is equal to the original active_count (saved in
3000 workspace[1]) it means that (*F) was found on every active state. In this
3001 case we don't want to give a partial match.
3002
3003 The "could_continue" variable is true if a state could have continued but
3004 for the fact that the end of the subject was reached. */
3005
3006 if (new_count <= 0)
3007 {
3008 if (rlevel == 1 && /* Top level, and */
3009 could_continue && /* Some could go on, and */
3010 forced_fail != workspace[1] && /* Not all forced fail & */
3011 ( /* either... */
3012 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
3013 || /* or... */
3014 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
3015 match_count < 0) /* no matches */
3016 ) && /* And... */
3017 (
3018 partial_newline || /* Either partial NL */
3019 ( /* or ... */
3020 ptr >= end_subject && /* End of subject and */
3021 ptr > md->start_used_ptr) /* Inspected non-empty string */
3022 )
3023 )
3024 {
3025 if (offsetcount >= 2)
3026 {
3027 offsets[0] = (int)(md->start_used_ptr - start_subject);
3028 offsets[1] = (int)(end_subject - start_subject);
3029 }
3030 match_count = PCRE_ERROR_PARTIAL;
3031 }
3032
3033 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
3034 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
3035 rlevel*2-2, SP));
3036 break; /* In effect, "return", but see the comment below */
3037 }
3038
3039 /* One or more states are active for the next character. */
3040
3041 ptr += clen; /* Advance to next subject character */
3042 } /* Loop to move along the subject string */
3043
3044 /* Control gets here from "break" a few lines above. We do it this way because
3045 if we use "return" above, we have compiler trouble. Some compilers warn if
3046 there's nothing here because they think the function doesn't return a value. On
3047 the other hand, if we put a dummy statement here, some more clever compilers
3048 complain that it can't be reached. Sigh. */
3049
3050 return match_count;
3051 }
3052
3053
3054
3055
3056 /*************************************************
3057 * Execute a Regular Expression - DFA engine *
3058 *************************************************/
3059
3060 /* This external function applies a compiled re to a subject string using a DFA
3061 engine. This function calls the internal function multiple times if the pattern
3062 is not anchored.
3063
3064 Arguments:
3065 argument_re points to the compiled expression
3066 extra_data points to extra data or is NULL
3067 subject points to the subject string
3068 length length of subject string (may contain binary zeros)
3069 start_offset where to start in the subject string
3070 options option bits
3071 offsets vector of match offsets
3072 offsetcount size of same
3073 workspace workspace vector
3074 wscount size of same
3075
3076 Returns: > 0 => number of match offset pairs placed in offsets
3077 = 0 => offsets overflowed; longest matches are present
3078 -1 => failed to match
3079 < -1 => some kind of unexpected problem
3080 */
3081
3082 #if defined COMPILE_PCRE8
3083 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3084 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3085 const char *subject, int length, int start_offset, int options, int *offsets,
3086 int offsetcount, int *workspace, int wscount)
3087 #elif defined COMPILE_PCRE16
3088 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3089 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3090 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3091 int offsetcount, int *workspace, int wscount)
3092 #elif defined COMPILE_PCRE32
3093 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3094 pcre32_dfa_exec(const pcre32 *argument_re, const pcre32_extra *extra_data,
3095 PCRE_SPTR32 subject, int length, int start_offset, int options, int *offsets,
3096 int offsetcount, int *workspace, int wscount)
3097 #endif
3098 {
3099 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3100 dfa_match_data match_block;
3101 dfa_match_data *md = &match_block;
3102 BOOL utf, anchored, startline, firstline;
3103 const pcre_uchar *current_subject, *end_subject;
3104 const pcre_study_data *study = NULL;
3105
3106 const pcre_uchar *req_char_ptr;
3107 const pcre_uint8 *start_bits = NULL;
3108 BOOL has_first_char = FALSE;
3109 BOOL has_req_char = FALSE;
3110 pcre_uchar first_char = 0;
3111 pcre_uchar first_char2 = 0;
3112 pcre_uchar req_char = 0;
3113 pcre_uchar req_char2 = 0;
3114 int newline;
3115
3116 /* Plausibility checks */
3117
3118 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3119 if (re == NULL || subject == NULL || workspace == NULL ||
3120 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3121 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3122 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3123 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3124
3125 /* Check that the first field in the block is the magic number. If it is not,
3126 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3127 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3128 means that the pattern is likely compiled with different endianness. */
3129
3130 if (re->magic_number != MAGIC_NUMBER)
3131 return re->magic_number == REVERSED_MAGIC_NUMBER?
3132 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3133 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3134
3135 /* If restarting after a partial match, do some sanity checks on the contents
3136 of the workspace. */
3137
3138 if ((options & PCRE_DFA_RESTART) != 0)
3139 {
3140 if ((workspace[0] & (-2)) != 0 || workspace[1] < 1 ||
3141 workspace[1] > (wscount - 2)/INTS_PER_STATEBLOCK)
3142 return PCRE_ERROR_DFA_BADRESTART;
3143 }
3144
3145 /* Set up study, callout, and table data */
3146
3147 md->tables = re->tables;
3148 md->callout_data = NULL;
3149
3150 if (extra_data != NULL)
3151 {
3152 unsigned int flags = extra_data->flags;
3153 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3154 study = (const pcre_study_data *)extra_data->study_data;
3155 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3156 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3157 return PCRE_ERROR_DFA_UMLIMIT;
3158 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3159 md->callout_data = extra_data->callout_data;
3160 if ((flags & PCRE_EXTRA_TABLES) != 0)
3161 md->tables = extra_data->tables;
3162 }
3163
3164 /* Set some local values */
3165
3166 current_subject = (const pcre_uchar *)subject + start_offset;
3167 end_subject = (const pcre_uchar *)subject + length;
3168 req_char_ptr = current_subject - 1;
3169
3170 #ifdef SUPPORT_UTF
3171 /* PCRE_UTF(16|32) have the same value as PCRE_UTF8. */
3172 utf = (re->options & PCRE_UTF8) != 0;
3173 #else
3174 utf = FALSE;
3175 #endif
3176
3177 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3178 (re->options & PCRE_ANCHORED) != 0;
3179
3180 /* The remaining fixed data for passing around. */
3181
3182 md->start_code = (const pcre_uchar *)argument_re +
3183 re->name_table_offset + re->name_count * re->name_entry_size;
3184 md->start_subject = (const pcre_uchar *)subject;
3185 md->end_subject = end_subject;
3186 md->start_offset = start_offset;
3187 md->moptions = options;
3188 md->poptions = re->options;
3189
3190 /* If the BSR option is not set at match time, copy what was set
3191 at compile time. */
3192
3193 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3194 {
3195 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3196 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3197 #ifdef BSR_ANYCRLF
3198 else md->moptions |= PCRE_BSR_ANYCRLF;
3199 #endif
3200 }
3201
3202 /* Handle different types of newline. The three bits give eight cases. If
3203 nothing is set at run time, whatever was used at compile time applies. */
3204
3205 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3206 PCRE_NEWLINE_BITS)
3207 {
3208 case 0: newline = NEWLINE; break; /* Compile-time default */
3209 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3210 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3211 case PCRE_NEWLINE_CR+
3212 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3213 case PCRE_NEWLINE_ANY: newline = -1; break;
3214 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3215 default: return PCRE_ERROR_BADNEWLINE;
3216 }
3217
3218 if (newline == -2)
3219 {
3220 md->nltype = NLTYPE_ANYCRLF;
3221 }
3222 else if (newline < 0)
3223 {
3224 md->nltype = NLTYPE_ANY;
3225 }
3226 else
3227 {
3228 md->nltype = NLTYPE_FIXED;
3229 if (newline > 255)
3230 {
3231 md->nllen = 2;
3232 md->nl[0] = (newline >> 8) & 255;
3233 md->nl[1] = newline & 255;
3234 }
3235 else
3236 {
3237 md->nllen = 1;
3238 md->nl[0] = newline;
3239 }
3240 }
3241
3242 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3243 back the character offset. */
3244
3245 #ifdef SUPPORT_UTF
3246 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3247 {
3248 int erroroffset;
3249 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3250 if (errorcode != 0)
3251 {
3252 if (offsetcount >= 2)
3253 {
3254 offsets[0] = erroroffset;
3255 offsets[1] = errorcode;
3256 }
3257 #if defined COMPILE_PCRE8
3258 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0) ?
3259 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3260 #elif defined COMPILE_PCRE16
3261 return (errorcode <= PCRE_UTF16_ERR1 && (options & PCRE_PARTIAL_HARD) != 0) ?
3262 PCRE_ERROR_SHORTUTF16 : PCRE_ERROR_BADUTF16;
3263 #elif defined COMPILE_PCRE32
3264 return PCRE_ERROR_BADUTF32;
3265 #endif
3266 }
3267 #if defined COMPILE_PCRE8 || defined COMPILE_PCRE16
3268 if (start_offset > 0 && start_offset < length &&
3269 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3270 return PCRE_ERROR_BADUTF8_OFFSET;
3271 #endif
3272 }
3273 #endif
3274
3275 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3276 is a feature that makes it possible to save compiled regex and re-use them
3277 in other programs later. */
3278
3279 if (md->tables == NULL) md->tables = PRIV(default_tables);
3280
3281 /* The "must be at the start of a line" flags are used in a loop when finding
3282 where to start. */
3283
3284 startline = (re->flags & PCRE_STARTLINE) != 0;
3285 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3286
3287 /* Set up the first character to match, if available. The first_byte value is
3288 never set for an anchored regular expression, but the anchoring may be forced
3289 at run time, so we have to test for anchoring. The first char may be unset for
3290 an unanchored pattern, of course. If there's no first char and the pattern was
3291 studied, there may be a bitmap of possible first characters. */
3292
3293 if (!anchored)
3294 {
3295 if ((re->flags & PCRE_FIRSTSET) != 0)
3296 {
3297 has_first_char = TRUE;
3298 first_char = first_char2 = (pcre_uchar)(re->first_char);
3299 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3300 {
3301 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3302 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3303 if (utf && first_char > 127)
3304 first_char2 = UCD_OTHERCASE(first_char);
3305 #endif
3306 }
3307 }
3308 else
3309 {
3310 if (!startline && study != NULL &&
3311 (study->flags & PCRE_STUDY_MAPPED) != 0)
3312 start_bits = study->start_bits;
3313 }
3314 }
3315
3316 /* For anchored or unanchored matches, there may be a "last known required
3317 character" set. */
3318
3319 if ((re->flags & PCRE_REQCHSET) != 0)
3320 {
3321 has_req_char = TRUE;
3322 req_char = req_char2 = (pcre_uchar)(re->req_char);
3323 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3324 {
3325 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3326 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3327 if (utf && req_char > 127)
3328 req_char2 = UCD_OTHERCASE(req_char);
3329 #endif
3330 }
3331 }
3332
3333 /* Call the main matching function, looping for a non-anchored regex after a
3334 failed match. If not restarting, perform certain optimizations at the start of
3335 a match. */
3336
3337 for (;;)
3338 {
3339 int rc;
3340
3341 if ((options & PCRE_DFA_RESTART) == 0)
3342 {
3343 const pcre_uchar *save_end_subject = end_subject;
3344
3345 /* If firstline is TRUE, the start of the match is constrained to the first
3346 line of a multiline string. Implement this by temporarily adjusting
3347 end_subject so that we stop scanning at a newline. If the match fails at
3348 the newline, later code breaks this loop. */
3349
3350 if (firstline)
3351 {
3352 PCRE_PUCHAR t = current_subject;
3353 #ifdef SUPPORT_UTF
3354 if (utf)
3355 {
3356 while (t < md->end_subject && !IS_NEWLINE(t))
3357 {
3358 t++;
3359 ACROSSCHAR(t < end_subject, *t, t++);
3360 }
3361 }
3362 else
3363 #endif
3364 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3365 end_subject = t;
3366 }
3367
3368 /* There are some optimizations that avoid running the match if a known
3369 starting point is not found. However, there is an option that disables
3370 these, for testing and for ensuring that all callouts do actually occur.
3371 The option can be set in the regex by (*NO_START_OPT) or passed in
3372 match-time options. */
3373
3374 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3375 {
3376 /* Advance to a known first char. */
3377
3378 if (has_first_char)
3379 {
3380 if (first_char != first_char2)
3381 while (current_subject < end_subject &&
3382 *current_subject != first_char && *current_subject != first_char2)
3383 current_subject++;
3384 else
3385 while (current_subject < end_subject &&
3386 *current_subject != first_char)
3387 current_subject++;
3388 }
3389
3390 /* Or to just after a linebreak for a multiline match if possible */
3391
3392 else if (startline)
3393 {
3394 if (current_subject > md->start_subject + start_offset)
3395 {
3396 #ifdef SUPPORT_UTF
3397 if (utf)
3398 {
3399 while (current_subject < end_subject &&
3400 !WAS_NEWLINE(current_subject))
3401 {
3402 current_subject++;
3403 ACROSSCHAR(current_subject < end_subject, *current_subject,
3404 current_subject++);
3405 }
3406 }
3407 else
3408 #endif
3409 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3410 current_subject++;
3411
3412 /* If we have just passed a CR and the newline option is ANY or
3413 ANYCRLF, and we are now at a LF, advance the match position by one
3414 more character. */
3415
3416 if (current_subject[-1] == CHAR_CR &&
3417 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3418 current_subject < end_subject &&
3419 *current_subject == CHAR_NL)
3420 current_subject++;
3421 }
3422 }
3423
3424 /* Or to a non-unique first char after study */
3425
3426 else if (start_bits != NULL)
3427 {
3428 while (current_subject < end_subject)
3429 {
3430 register unsigned int c = *current_subject;
3431 #ifndef COMPILE_PCRE8
3432 if (c > 255) c = 255;
3433 #endif
3434 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3435 {
3436 current_subject++;
3437 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3438 /* In non 8-bit mode, the iteration will stop for
3439 characters > 255 at the beginning or not stop at all. */
3440 if (utf)
3441 ACROSSCHAR(current_subject < end_subject, *current_subject,
3442 current_subject++);
3443 #endif
3444 }
3445 else break;
3446 }
3447 }
3448 }
3449
3450 /* Restore fudged end_subject */
3451
3452 end_subject = save_end_subject;
3453
3454 /* The following two optimizations are disabled for partial matching or if
3455 disabling is explicitly requested (and of course, by the test above, this
3456 code is not obeyed when restarting after a partial match). */
3457
3458 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3459 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3460 {
3461 /* If the pattern was studied, a minimum subject length may be set. This
3462 is a lower bound; no actual string of that length may actually match the
3463 pattern. Although the value is, strictly, in characters, we treat it as
3464 bytes to avoid spending too much time in this optimization. */
3465
3466 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3467 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3468 return PCRE_ERROR_NOMATCH;
3469
3470 /* If req_char is set, we know that that character must appear in the
3471 subject for the match to succeed. If the first character is set, req_char
3472 must be later in the subject; otherwise the test starts at the match
3473 point. This optimization can save a huge amount of work in patterns with
3474 nested unlimited repeats that aren't going to match. Writing separate
3475 code for cased/caseless versions makes it go faster, as does using an
3476 autoincrement and backing off on a match.
3477
3478 HOWEVER: when the subject string is very, very long, searching to its end
3479 can take a long time, and give bad performance on quite ordinary
3480 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3481 string... so we don't do this when the string is sufficiently long. */
3482
3483 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3484 {
3485 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3486
3487 /* We don't need to repeat the search if we haven't yet reached the
3488 place we found it at last time. */
3489
3490 if (p > req_char_ptr)
3491 {
3492 if (req_char != req_char2)
3493 {
3494 while (p < end_subject)
3495 {
3496 register pcre_uint32 pp = *p++;
3497 if (pp == req_char || pp == req_char2) { p--; break; }
3498 }
3499 }
3500 else
3501 {
3502 while (p < end_subject)
3503 {
3504 if (*p++ == req_char) { p--; break; }
3505 }
3506 }
3507
3508 /* If we can't find the required character, break the matching loop,
3509 which will cause a return or PCRE_ERROR_NOMATCH. */
3510
3511 if (p >= end_subject) break;
3512
3513 /* If we have found the required character, save the point where we
3514 found it, so that we don't search again next time round the loop if
3515 the start hasn't passed this character yet. */
3516
3517 req_char_ptr = p;
3518 }
3519 }
3520 }
3521 } /* End of optimizations that are done when not restarting */
3522
3523 /* OK, now we can do the business */
3524
3525 md->start_used_ptr = current_subject;
3526 md->recursive = NULL;
3527
3528 rc = internal_dfa_exec(
3529 md, /* fixed match data */
3530 md->start_code, /* this subexpression's code */
3531 current_subject, /* where we currently are */
3532 start_offset, /* start offset in subject */
3533 offsets, /* offset vector */
3534 offsetcount, /* size of same */
3535 workspace, /* workspace vector */
3536 wscount, /* size of same */
3537 0); /* function recurse level */
3538
3539 /* Anything other than "no match" means we are done, always; otherwise, carry
3540 on only if not anchored. */
3541
3542 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3543
3544 /* Advance to the next subject character unless we are at the end of a line
3545 and firstline is set. */
3546
3547 if (firstline && IS_NEWLINE(current_subject)) break;
3548 current_subject++;
3549 #ifdef SUPPORT_UTF
3550 if (utf)
3551 {
3552 ACROSSCHAR(current_subject < end_subject, *current_subject,
3553 current_subject++);
3554 }
3555 #endif
3556 if (current_subject > end_subject) break;
3557
3558 /* If we have just passed a CR and we are now at a LF, and the pattern does
3559 not contain any explicit matches for \r or \n, and the newline option is CRLF
3560 or ANY or ANYCRLF, advance the match position by one more character. */
3561
3562 if (current_subject[-1] == CHAR_CR &&
3563 current_subject < end_subject &&
3564 *current_subject == CHAR_NL &&
3565 (re->flags & PCRE_HASCRORLF) == 0 &&
3566 (md->nltype == NLTYPE_ANY ||
3567 md->nltype == NLTYPE_ANYCRLF ||
3568 md->nllen == 2))
3569 current_subject++;
3570
3571 } /* "Bumpalong" loop */
3572
3573 return PCRE_ERROR_NOMATCH;
3574 }
3575
3576 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5