/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1046 - (show annotations)
Tue Sep 25 16:27:58 2012 UTC (7 years ago) by ph10
File MIME type: text/plain
File size: 122730 byte(s)
All the remaining changes for handling characters with more than one other 
case.
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41 /* This module contains the external function pcre_dfa_exec(), which is an
42 alternative matching function that uses a sort of DFA algorithm (not a true
43 FSM). This is NOT Perl-compatible, but it has advantages in certain
44 applications. */
45
46
47 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
48 the performance of his patterns greatly. I could not use it as it stood, as it
49 was not thread safe, and made assumptions about pattern sizes. Also, it caused
50 test 7 to loop, and test 9 to crash with a segfault.
51
52 The issue is the check for duplicate states, which is done by a simple linear
53 search up the state list. (Grep for "duplicate" below to find the code.) For
54 many patterns, there will never be many states active at one time, so a simple
55 linear search is fine. In patterns that have many active states, it might be a
56 bottleneck. The suggested code used an indexing scheme to remember which states
57 had previously been used for each character, and avoided the linear search when
58 it knew there was no chance of a duplicate. This was implemented when adding
59 states to the state lists.
60
61 I wrote some thread-safe, not-limited code to try something similar at the time
62 of checking for duplicates (instead of when adding states), using index vectors
63 on the stack. It did give a 13% improvement with one specially constructed
64 pattern for certain subject strings, but on other strings and on many of the
65 simpler patterns in the test suite it did worse. The major problem, I think,
66 was the extra time to initialize the index. This had to be done for each call
67 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
68 only once - I suspect this was the cause of the problems with the tests.)
69
70 Overall, I concluded that the gains in some cases did not outweigh the losses
71 in others, so I abandoned this code. */
72
73
74
75 #ifdef HAVE_CONFIG_H
76 #include "config.h"
77 #endif
78
79 #define NLBLOCK md /* Block containing newline information */
80 #define PSSTART start_subject /* Field containing processed string start */
81 #define PSEND end_subject /* Field containing processed string end */
82
83 #include "pcre_internal.h"
84
85
86 /* For use to indent debugging output */
87
88 #define SP " "
89
90
91 /*************************************************
92 * Code parameters and static tables *
93 *************************************************/
94
95 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
96 into others, under special conditions. A gap of 20 between the blocks should be
97 enough. The resulting opcodes don't have to be less than 256 because they are
98 never stored, so we push them well clear of the normal opcodes. */
99
100 #define OP_PROP_EXTRA 300
101 #define OP_EXTUNI_EXTRA 320
102 #define OP_ANYNL_EXTRA 340
103 #define OP_HSPACE_EXTRA 360
104 #define OP_VSPACE_EXTRA 380
105
106
107 /* This table identifies those opcodes that are followed immediately by a
108 character that is to be tested in some way. This makes it possible to
109 centralize the loading of these characters. In the case of Type * etc, the
110 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
111 small value. Non-zero values in the table are the offsets from the opcode where
112 the character is to be found. ***NOTE*** If the start of this table is
113 modified, the three tables that follow must also be modified. */
114
115 static const pcre_uint8 coptable[] = {
116 0, /* End */
117 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
118 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
119 0, 0, 0, /* Any, AllAny, Anybyte */
120 0, 0, /* \P, \p */
121 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
122 0, /* \X */
123 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
124 1, /* Char */
125 1, /* Chari */
126 1, /* not */
127 1, /* noti */
128 /* Positive single-char repeats */
129 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
130 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
131 1+IMM2_SIZE, /* exact */
132 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
135 1+IMM2_SIZE, /* exact I */
136 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
137 /* Negative single-char repeats - only for chars < 256 */
138 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
139 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
140 1+IMM2_SIZE, /* NOT exact */
141 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
142 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
143 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
144 1+IMM2_SIZE, /* NOT exact I */
145 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
146 /* Positive type repeats */
147 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
148 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
149 1+IMM2_SIZE, /* Type exact */
150 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
151 /* Character class & ref repeats */
152 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
153 0, 0, /* CRRANGE, CRMINRANGE */
154 0, /* CLASS */
155 0, /* NCLASS */
156 0, /* XCLASS - variable length */
157 0, /* REF */
158 0, /* REFI */
159 0, /* RECURSE */
160 0, /* CALLOUT */
161 0, /* Alt */
162 0, /* Ket */
163 0, /* KetRmax */
164 0, /* KetRmin */
165 0, /* KetRpos */
166 0, /* Reverse */
167 0, /* Assert */
168 0, /* Assert not */
169 0, /* Assert behind */
170 0, /* Assert behind not */
171 0, 0, /* ONCE, ONCE_NC */
172 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
173 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
174 0, 0, /* CREF, NCREF */
175 0, 0, /* RREF, NRREF */
176 0, /* DEF */
177 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
178 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
179 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
180 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
181 0, 0 /* CLOSE, SKIPZERO */
182 };
183
184 /* This table identifies those opcodes that inspect a character. It is used to
185 remember the fact that a character could have been inspected when the end of
186 the subject is reached. ***NOTE*** If the start of this table is modified, the
187 two tables that follow must also be modified. */
188
189 static const pcre_uint8 poptable[] = {
190 0, /* End */
191 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
192 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
193 1, 1, 1, /* Any, AllAny, Anybyte */
194 1, 1, /* \P, \p */
195 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
196 1, /* \X */
197 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
198 1, /* Char */
199 1, /* Chari */
200 1, /* not */
201 1, /* noti */
202 /* Positive single-char repeats */
203 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
204 1, 1, 1, /* upto, minupto, exact */
205 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
206 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
207 1, 1, 1, /* upto I, minupto I, exact I */
208 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
209 /* Negative single-char repeats - only for chars < 256 */
210 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
211 1, 1, 1, /* NOT upto, minupto, exact */
212 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
213 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
214 1, 1, 1, /* NOT upto I, minupto I, exact I */
215 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
216 /* Positive type repeats */
217 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
218 1, 1, 1, /* Type upto, minupto, exact */
219 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
220 /* Character class & ref repeats */
221 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
222 1, 1, /* CRRANGE, CRMINRANGE */
223 1, /* CLASS */
224 1, /* NCLASS */
225 1, /* XCLASS - variable length */
226 0, /* REF */
227 0, /* REFI */
228 0, /* RECURSE */
229 0, /* CALLOUT */
230 0, /* Alt */
231 0, /* Ket */
232 0, /* KetRmax */
233 0, /* KetRmin */
234 0, /* KetRpos */
235 0, /* Reverse */
236 0, /* Assert */
237 0, /* Assert not */
238 0, /* Assert behind */
239 0, /* Assert behind not */
240 0, 0, /* ONCE, ONCE_NC */
241 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
242 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
243 0, 0, /* CREF, NCREF */
244 0, 0, /* RREF, NRREF */
245 0, /* DEF */
246 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
247 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
248 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
249 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
250 0, 0 /* CLOSE, SKIPZERO */
251 };
252
253 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
254 and \w */
255
256 static const pcre_uint8 toptable1[] = {
257 0, 0, 0, 0, 0, 0,
258 ctype_digit, ctype_digit,
259 ctype_space, ctype_space,
260 ctype_word, ctype_word,
261 0, 0 /* OP_ANY, OP_ALLANY */
262 };
263
264 static const pcre_uint8 toptable2[] = {
265 0, 0, 0, 0, 0, 0,
266 ctype_digit, 0,
267 ctype_space, 0,
268 ctype_word, 0,
269 1, 1 /* OP_ANY, OP_ALLANY */
270 };
271
272
273 /* Structure for holding data about a particular state, which is in effect the
274 current data for an active path through the match tree. It must consist
275 entirely of ints because the working vector we are passed, and which we put
276 these structures in, is a vector of ints. */
277
278 typedef struct stateblock {
279 int offset; /* Offset to opcode */
280 int count; /* Count for repeats */
281 int data; /* Some use extra data */
282 } stateblock;
283
284 #define INTS_PER_STATEBLOCK (int)(sizeof(stateblock)/sizeof(int))
285
286
287 #ifdef PCRE_DEBUG
288 /*************************************************
289 * Print character string *
290 *************************************************/
291
292 /* Character string printing function for debugging.
293
294 Arguments:
295 p points to string
296 length number of bytes
297 f where to print
298
299 Returns: nothing
300 */
301
302 static void
303 pchars(const pcre_uchar *p, int length, FILE *f)
304 {
305 int c;
306 while (length-- > 0)
307 {
308 if (isprint(c = *(p++)))
309 fprintf(f, "%c", c);
310 else
311 fprintf(f, "\\x%02x", c);
312 }
313 }
314 #endif
315
316
317
318 /*************************************************
319 * Execute a Regular Expression - DFA engine *
320 *************************************************/
321
322 /* This internal function applies a compiled pattern to a subject string,
323 starting at a given point, using a DFA engine. This function is called from the
324 external one, possibly multiple times if the pattern is not anchored. The
325 function calls itself recursively for some kinds of subpattern.
326
327 Arguments:
328 md the match_data block with fixed information
329 this_start_code the opening bracket of this subexpression's code
330 current_subject where we currently are in the subject string
331 start_offset start offset in the subject string
332 offsets vector to contain the matching string offsets
333 offsetcount size of same
334 workspace vector of workspace
335 wscount size of same
336 rlevel function call recursion level
337
338 Returns: > 0 => number of match offset pairs placed in offsets
339 = 0 => offsets overflowed; longest matches are present
340 -1 => failed to match
341 < -1 => some kind of unexpected problem
342
343 The following macros are used for adding states to the two state vectors (one
344 for the current character, one for the following character). */
345
346 #define ADD_ACTIVE(x,y) \
347 if (active_count++ < wscount) \
348 { \
349 next_active_state->offset = (x); \
350 next_active_state->count = (y); \
351 next_active_state++; \
352 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
353 } \
354 else return PCRE_ERROR_DFA_WSSIZE
355
356 #define ADD_ACTIVE_DATA(x,y,z) \
357 if (active_count++ < wscount) \
358 { \
359 next_active_state->offset = (x); \
360 next_active_state->count = (y); \
361 next_active_state->data = (z); \
362 next_active_state++; \
363 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
364 } \
365 else return PCRE_ERROR_DFA_WSSIZE
366
367 #define ADD_NEW(x,y) \
368 if (new_count++ < wscount) \
369 { \
370 next_new_state->offset = (x); \
371 next_new_state->count = (y); \
372 next_new_state++; \
373 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
374 } \
375 else return PCRE_ERROR_DFA_WSSIZE
376
377 #define ADD_NEW_DATA(x,y,z) \
378 if (new_count++ < wscount) \
379 { \
380 next_new_state->offset = (x); \
381 next_new_state->count = (y); \
382 next_new_state->data = (z); \
383 next_new_state++; \
384 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d) line %d\n", rlevel*2-2, SP, \
385 (x), (y), (z), __LINE__)); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 BOOL reset_could_continue = FALSE;
428
429 rlevel++;
430 offsetcount &= (-2);
431
432 wscount -= 2;
433 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
434 (2 * INTS_PER_STATEBLOCK);
435
436 DPRINTF(("\n%.*s---------------------\n"
437 "%.*sCall to internal_dfa_exec f=%d\n",
438 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
439
440 ctypes = md->tables + ctypes_offset;
441 lcc = md->tables + lcc_offset;
442 fcc = md->tables + fcc_offset;
443
444 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
445
446 active_states = (stateblock *)(workspace + 2);
447 next_new_state = new_states = active_states + wscount;
448 new_count = 0;
449
450 first_op = this_start_code + 1 + LINK_SIZE +
451 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
452 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
453 ? IMM2_SIZE:0);
454
455 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
456 the alternative states onto the list, and find out where the end is. This
457 makes is possible to use this function recursively, when we want to stop at a
458 matching internal ket rather than at the end.
459
460 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
461 a backward assertion. In that case, we have to find out the maximum amount to
462 move back, and set up each alternative appropriately. */
463
464 if (*first_op == OP_REVERSE)
465 {
466 int max_back = 0;
467 int gone_back;
468
469 end_code = this_start_code;
470 do
471 {
472 int back = GET(end_code, 2+LINK_SIZE);
473 if (back > max_back) max_back = back;
474 end_code += GET(end_code, 1);
475 }
476 while (*end_code == OP_ALT);
477
478 /* If we can't go back the amount required for the longest lookbehind
479 pattern, go back as far as we can; some alternatives may still be viable. */
480
481 #ifdef SUPPORT_UTF
482 /* In character mode we have to step back character by character */
483
484 if (utf)
485 {
486 for (gone_back = 0; gone_back < max_back; gone_back++)
487 {
488 if (current_subject <= start_subject) break;
489 current_subject--;
490 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
491 }
492 }
493 else
494 #endif
495
496 /* In byte-mode we can do this quickly. */
497
498 {
499 gone_back = (current_subject - max_back < start_subject)?
500 (int)(current_subject - start_subject) : max_back;
501 current_subject -= gone_back;
502 }
503
504 /* Save the earliest consulted character */
505
506 if (current_subject < md->start_used_ptr)
507 md->start_used_ptr = current_subject;
508
509 /* Now we can process the individual branches. */
510
511 end_code = this_start_code;
512 do
513 {
514 int back = GET(end_code, 2+LINK_SIZE);
515 if (back <= gone_back)
516 {
517 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
518 ADD_NEW_DATA(-bstate, 0, gone_back - back);
519 }
520 end_code += GET(end_code, 1);
521 }
522 while (*end_code == OP_ALT);
523 }
524
525 /* This is the code for a "normal" subpattern (not a backward assertion). The
526 start of a whole pattern is always one of these. If we are at the top level,
527 we may be asked to restart matching from the same point that we reached for a
528 previous partial match. We still have to scan through the top-level branches to
529 find the end state. */
530
531 else
532 {
533 end_code = this_start_code;
534
535 /* Restarting */
536
537 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
538 {
539 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
540 new_count = workspace[1];
541 if (!workspace[0])
542 memcpy(new_states, active_states, new_count * sizeof(stateblock));
543 }
544
545 /* Not restarting */
546
547 else
548 {
549 int length = 1 + LINK_SIZE +
550 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
551 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
552 ? IMM2_SIZE:0);
553 do
554 {
555 ADD_NEW((int)(end_code - start_code + length), 0);
556 end_code += GET(end_code, 1);
557 length = 1 + LINK_SIZE;
558 }
559 while (*end_code == OP_ALT);
560 }
561 }
562
563 workspace[0] = 0; /* Bit indicating which vector is current */
564
565 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
566
567 /* Loop for scanning the subject */
568
569 ptr = current_subject;
570 for (;;)
571 {
572 int i, j;
573 int clen, dlen;
574 unsigned int c, d;
575 int forced_fail = 0;
576 BOOL partial_newline = FALSE;
577 BOOL could_continue = reset_could_continue;
578 reset_could_continue = FALSE;
579
580 /* Make the new state list into the active state list and empty the
581 new state list. */
582
583 temp_states = active_states;
584 active_states = new_states;
585 new_states = temp_states;
586 active_count = new_count;
587 new_count = 0;
588
589 workspace[0] ^= 1; /* Remember for the restarting feature */
590 workspace[1] = active_count;
591
592 #ifdef PCRE_DEBUG
593 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
594 pchars(ptr, STRLEN_UC(ptr), stdout);
595 printf("\"\n");
596
597 printf("%.*sActive states: ", rlevel*2-2, SP);
598 for (i = 0; i < active_count; i++)
599 printf("%d/%d ", active_states[i].offset, active_states[i].count);
600 printf("\n");
601 #endif
602
603 /* Set the pointers for adding new states */
604
605 next_active_state = active_states + active_count;
606 next_new_state = new_states;
607
608 /* Load the current character from the subject outside the loop, as many
609 different states may want to look at it, and we assume that at least one
610 will. */
611
612 if (ptr < end_subject)
613 {
614 clen = 1; /* Number of data items in the character */
615 #ifdef SUPPORT_UTF
616 if (utf) { GETCHARLEN(c, ptr, clen); } else
617 #endif /* SUPPORT_UTF */
618 c = *ptr;
619 }
620 else
621 {
622 clen = 0; /* This indicates the end of the subject */
623 c = NOTACHAR; /* This value should never actually be used */
624 }
625
626 /* Scan up the active states and act on each one. The result of an action
627 may be to add more states to the currently active list (e.g. on hitting a
628 parenthesis) or it may be to put states on the new list, for considering
629 when we move the character pointer on. */
630
631 for (i = 0; i < active_count; i++)
632 {
633 stateblock *current_state = active_states + i;
634 BOOL caseless = FALSE;
635 const pcre_uchar *code;
636 int state_offset = current_state->offset;
637 int count, codevalue, rrc;
638
639 #ifdef PCRE_DEBUG
640 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
641 if (clen == 0) printf("EOL\n");
642 else if (c > 32 && c < 127) printf("'%c'\n", c);
643 else printf("0x%02x\n", c);
644 #endif
645
646 /* A negative offset is a special case meaning "hold off going to this
647 (negated) state until the number of characters in the data field have
648 been skipped". If the could_continue flag was passed over from a previous
649 state, arrange for it to passed on. */
650
651 if (state_offset < 0)
652 {
653 if (current_state->data > 0)
654 {
655 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
656 ADD_NEW_DATA(state_offset, current_state->count,
657 current_state->data - 1);
658 if (could_continue) reset_could_continue = TRUE;
659 continue;
660 }
661 else
662 {
663 current_state->offset = state_offset = -state_offset;
664 }
665 }
666
667 /* Check for a duplicate state with the same count, and skip if found.
668 See the note at the head of this module about the possibility of improving
669 performance here. */
670
671 for (j = 0; j < i; j++)
672 {
673 if (active_states[j].offset == state_offset &&
674 active_states[j].count == current_state->count)
675 {
676 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
677 goto NEXT_ACTIVE_STATE;
678 }
679 }
680
681 /* The state offset is the offset to the opcode */
682
683 code = start_code + state_offset;
684 codevalue = *code;
685
686 /* If this opcode inspects a character, but we are at the end of the
687 subject, remember the fact for use when testing for a partial match. */
688
689 if (clen == 0 && poptable[codevalue] != 0)
690 could_continue = TRUE;
691
692 /* If this opcode is followed by an inline character, load it. It is
693 tempting to test for the presence of a subject character here, but that
694 is wrong, because sometimes zero repetitions of the subject are
695 permitted.
696
697 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
698 argument that is not a data character - but is always one byte long because
699 the values are small. We have to take special action to deal with \P, \p,
700 \H, \h, \V, \v and \X in this case. To keep the other cases fast, convert
701 these ones to new opcodes. */
702
703 if (coptable[codevalue] > 0)
704 {
705 dlen = 1;
706 #ifdef SUPPORT_UTF
707 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
708 #endif /* SUPPORT_UTF */
709 d = code[coptable[codevalue]];
710 if (codevalue >= OP_TYPESTAR)
711 {
712 switch(d)
713 {
714 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
715 case OP_NOTPROP:
716 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
717 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
718 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
719 case OP_NOT_HSPACE:
720 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
721 case OP_NOT_VSPACE:
722 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
723 default: break;
724 }
725 }
726 }
727 else
728 {
729 dlen = 0; /* Not strictly necessary, but compilers moan */
730 d = NOTACHAR; /* if these variables are not set. */
731 }
732
733
734 /* Now process the individual opcodes */
735
736 switch (codevalue)
737 {
738 /* ========================================================================== */
739 /* These cases are never obeyed. This is a fudge that causes a compile-
740 time error if the vectors coptable or poptable, which are indexed by
741 opcode, are not the correct length. It seems to be the only way to do
742 such a check at compile time, as the sizeof() operator does not work
743 in the C preprocessor. */
744
745 case OP_TABLE_LENGTH:
746 case OP_TABLE_LENGTH +
747 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
748 (sizeof(poptable) == OP_TABLE_LENGTH)):
749 break;
750
751 /* ========================================================================== */
752 /* Reached a closing bracket. If not at the end of the pattern, carry
753 on with the next opcode. For repeating opcodes, also add the repeat
754 state. Note that KETRPOS will always be encountered at the end of the
755 subpattern, because the possessive subpattern repeats are always handled
756 using recursive calls. Thus, it never adds any new states.
757
758 At the end of the (sub)pattern, unless we have an empty string and
759 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
760 start of the subject, save the match data, shifting up all previous
761 matches so we always have the longest first. */
762
763 case OP_KET:
764 case OP_KETRMIN:
765 case OP_KETRMAX:
766 case OP_KETRPOS:
767 if (code != end_code)
768 {
769 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
770 if (codevalue != OP_KET)
771 {
772 ADD_ACTIVE(state_offset - GET(code, 1), 0);
773 }
774 }
775 else
776 {
777 if (ptr > current_subject ||
778 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
779 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
780 current_subject > start_subject + md->start_offset)))
781 {
782 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
783 else if (match_count > 0 && ++match_count * 2 > offsetcount)
784 match_count = 0;
785 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
786 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
787 if (offsetcount >= 2)
788 {
789 offsets[0] = (int)(current_subject - start_subject);
790 offsets[1] = (int)(ptr - start_subject);
791 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
792 offsets[1] - offsets[0], (char *)current_subject));
793 }
794 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
795 {
796 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
797 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
798 match_count, rlevel*2-2, SP));
799 return match_count;
800 }
801 }
802 }
803 break;
804
805 /* ========================================================================== */
806 /* These opcodes add to the current list of states without looking
807 at the current character. */
808
809 /*-----------------------------------------------------------------*/
810 case OP_ALT:
811 do { code += GET(code, 1); } while (*code == OP_ALT);
812 ADD_ACTIVE((int)(code - start_code), 0);
813 break;
814
815 /*-----------------------------------------------------------------*/
816 case OP_BRA:
817 case OP_SBRA:
818 do
819 {
820 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
821 code += GET(code, 1);
822 }
823 while (*code == OP_ALT);
824 break;
825
826 /*-----------------------------------------------------------------*/
827 case OP_CBRA:
828 case OP_SCBRA:
829 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
830 code += GET(code, 1);
831 while (*code == OP_ALT)
832 {
833 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
834 code += GET(code, 1);
835 }
836 break;
837
838 /*-----------------------------------------------------------------*/
839 case OP_BRAZERO:
840 case OP_BRAMINZERO:
841 ADD_ACTIVE(state_offset + 1, 0);
842 code += 1 + GET(code, 2);
843 while (*code == OP_ALT) code += GET(code, 1);
844 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
845 break;
846
847 /*-----------------------------------------------------------------*/
848 case OP_SKIPZERO:
849 code += 1 + GET(code, 2);
850 while (*code == OP_ALT) code += GET(code, 1);
851 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
852 break;
853
854 /*-----------------------------------------------------------------*/
855 case OP_CIRC:
856 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
857 { ADD_ACTIVE(state_offset + 1, 0); }
858 break;
859
860 /*-----------------------------------------------------------------*/
861 case OP_CIRCM:
862 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
863 (ptr != end_subject && WAS_NEWLINE(ptr)))
864 { ADD_ACTIVE(state_offset + 1, 0); }
865 break;
866
867 /*-----------------------------------------------------------------*/
868 case OP_EOD:
869 if (ptr >= end_subject)
870 {
871 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
872 could_continue = TRUE;
873 else { ADD_ACTIVE(state_offset + 1, 0); }
874 }
875 break;
876
877 /*-----------------------------------------------------------------*/
878 case OP_SOD:
879 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
880 break;
881
882 /*-----------------------------------------------------------------*/
883 case OP_SOM:
884 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
885 break;
886
887
888 /* ========================================================================== */
889 /* These opcodes inspect the next subject character, and sometimes
890 the previous one as well, but do not have an argument. The variable
891 clen contains the length of the current character and is zero if we are
892 at the end of the subject. */
893
894 /*-----------------------------------------------------------------*/
895 case OP_ANY:
896 if (clen > 0 && !IS_NEWLINE(ptr))
897 {
898 if (ptr + 1 >= md->end_subject &&
899 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
900 NLBLOCK->nltype == NLTYPE_FIXED &&
901 NLBLOCK->nllen == 2 &&
902 c == NLBLOCK->nl[0])
903 {
904 could_continue = partial_newline = TRUE;
905 }
906 else
907 {
908 ADD_NEW(state_offset + 1, 0);
909 }
910 }
911 break;
912
913 /*-----------------------------------------------------------------*/
914 case OP_ALLANY:
915 if (clen > 0)
916 { ADD_NEW(state_offset + 1, 0); }
917 break;
918
919 /*-----------------------------------------------------------------*/
920 case OP_EODN:
921 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
922 could_continue = TRUE;
923 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
924 { ADD_ACTIVE(state_offset + 1, 0); }
925 break;
926
927 /*-----------------------------------------------------------------*/
928 case OP_DOLL:
929 if ((md->moptions & PCRE_NOTEOL) == 0)
930 {
931 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
932 could_continue = TRUE;
933 else if (clen == 0 ||
934 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
935 (ptr == end_subject - md->nllen)
936 ))
937 { ADD_ACTIVE(state_offset + 1, 0); }
938 else if (ptr + 1 >= md->end_subject &&
939 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
940 NLBLOCK->nltype == NLTYPE_FIXED &&
941 NLBLOCK->nllen == 2 &&
942 c == NLBLOCK->nl[0])
943 {
944 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
945 {
946 reset_could_continue = TRUE;
947 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
948 }
949 else could_continue = partial_newline = TRUE;
950 }
951 }
952 break;
953
954 /*-----------------------------------------------------------------*/
955 case OP_DOLLM:
956 if ((md->moptions & PCRE_NOTEOL) == 0)
957 {
958 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
959 could_continue = TRUE;
960 else if (clen == 0 ||
961 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
962 { ADD_ACTIVE(state_offset + 1, 0); }
963 else if (ptr + 1 >= md->end_subject &&
964 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
965 NLBLOCK->nltype == NLTYPE_FIXED &&
966 NLBLOCK->nllen == 2 &&
967 c == NLBLOCK->nl[0])
968 {
969 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
970 {
971 reset_could_continue = TRUE;
972 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
973 }
974 else could_continue = partial_newline = TRUE;
975 }
976 }
977 else if (IS_NEWLINE(ptr))
978 { ADD_ACTIVE(state_offset + 1, 0); }
979 break;
980
981 /*-----------------------------------------------------------------*/
982
983 case OP_DIGIT:
984 case OP_WHITESPACE:
985 case OP_WORDCHAR:
986 if (clen > 0 && c < 256 &&
987 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
988 { ADD_NEW(state_offset + 1, 0); }
989 break;
990
991 /*-----------------------------------------------------------------*/
992 case OP_NOT_DIGIT:
993 case OP_NOT_WHITESPACE:
994 case OP_NOT_WORDCHAR:
995 if (clen > 0 && (c >= 256 ||
996 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
997 { ADD_NEW(state_offset + 1, 0); }
998 break;
999
1000 /*-----------------------------------------------------------------*/
1001 case OP_WORD_BOUNDARY:
1002 case OP_NOT_WORD_BOUNDARY:
1003 {
1004 int left_word, right_word;
1005
1006 if (ptr > start_subject)
1007 {
1008 const pcre_uchar *temp = ptr - 1;
1009 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
1010 #ifdef SUPPORT_UTF
1011 if (utf) { BACKCHAR(temp); }
1012 #endif
1013 GETCHARTEST(d, temp);
1014 #ifdef SUPPORT_UCP
1015 if ((md->poptions & PCRE_UCP) != 0)
1016 {
1017 if (d == '_') left_word = TRUE; else
1018 {
1019 int cat = UCD_CATEGORY(d);
1020 left_word = (cat == ucp_L || cat == ucp_N);
1021 }
1022 }
1023 else
1024 #endif
1025 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1026 }
1027 else left_word = FALSE;
1028
1029 if (clen > 0)
1030 {
1031 #ifdef SUPPORT_UCP
1032 if ((md->poptions & PCRE_UCP) != 0)
1033 {
1034 if (c == '_') right_word = TRUE; else
1035 {
1036 int cat = UCD_CATEGORY(c);
1037 right_word = (cat == ucp_L || cat == ucp_N);
1038 }
1039 }
1040 else
1041 #endif
1042 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1043 }
1044 else right_word = FALSE;
1045
1046 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1047 { ADD_ACTIVE(state_offset + 1, 0); }
1048 }
1049 break;
1050
1051
1052 /*-----------------------------------------------------------------*/
1053 /* Check the next character by Unicode property. We will get here only
1054 if the support is in the binary; otherwise a compile-time error occurs.
1055 */
1056
1057 #ifdef SUPPORT_UCP
1058 case OP_PROP:
1059 case OP_NOTPROP:
1060 if (clen > 0)
1061 {
1062 BOOL OK;
1063 const pcre_uint32 *cp;
1064 const ucd_record * prop = GET_UCD(c);
1065 switch(code[1])
1066 {
1067 case PT_ANY:
1068 OK = TRUE;
1069 break;
1070
1071 case PT_LAMP:
1072 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1073 prop->chartype == ucp_Lt;
1074 break;
1075
1076 case PT_GC:
1077 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1078 break;
1079
1080 case PT_PC:
1081 OK = prop->chartype == code[2];
1082 break;
1083
1084 case PT_SC:
1085 OK = prop->script == code[2];
1086 break;
1087
1088 /* These are specials for combination cases. */
1089
1090 case PT_ALNUM:
1091 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1092 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1093 break;
1094
1095 case PT_SPACE: /* Perl space */
1096 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1097 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1098 break;
1099
1100 case PT_PXSPACE: /* POSIX space */
1101 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1102 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1103 c == CHAR_FF || c == CHAR_CR;
1104 break;
1105
1106 case PT_WORD:
1107 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1108 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1109 c == CHAR_UNDERSCORE;
1110 break;
1111
1112 case PT_CLIST:
1113 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1114 for (;;)
1115 {
1116 if (c < *cp) { OK = FALSE; break; }
1117 if (c == *cp++) { OK = TRUE; break; }
1118 }
1119 break;
1120
1121 /* Should never occur, but keep compilers from grumbling. */
1122
1123 default:
1124 OK = codevalue != OP_PROP;
1125 break;
1126 }
1127
1128 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1129 }
1130 break;
1131 #endif
1132
1133
1134
1135 /* ========================================================================== */
1136 /* These opcodes likewise inspect the subject character, but have an
1137 argument that is not a data character. It is one of these opcodes:
1138 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1139 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1140
1141 case OP_TYPEPLUS:
1142 case OP_TYPEMINPLUS:
1143 case OP_TYPEPOSPLUS:
1144 count = current_state->count; /* Already matched */
1145 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1146 if (clen > 0)
1147 {
1148 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1149 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1150 NLBLOCK->nltype == NLTYPE_FIXED &&
1151 NLBLOCK->nllen == 2 &&
1152 c == NLBLOCK->nl[0])
1153 {
1154 could_continue = partial_newline = TRUE;
1155 }
1156 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1157 (c < 256 &&
1158 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1159 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1160 {
1161 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1162 {
1163 active_count--; /* Remove non-match possibility */
1164 next_active_state--;
1165 }
1166 count++;
1167 ADD_NEW(state_offset, count);
1168 }
1169 }
1170 break;
1171
1172 /*-----------------------------------------------------------------*/
1173 case OP_TYPEQUERY:
1174 case OP_TYPEMINQUERY:
1175 case OP_TYPEPOSQUERY:
1176 ADD_ACTIVE(state_offset + 2, 0);
1177 if (clen > 0)
1178 {
1179 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1180 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1181 NLBLOCK->nltype == NLTYPE_FIXED &&
1182 NLBLOCK->nllen == 2 &&
1183 c == NLBLOCK->nl[0])
1184 {
1185 could_continue = partial_newline = TRUE;
1186 }
1187 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1188 (c < 256 &&
1189 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1190 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1191 {
1192 if (codevalue == OP_TYPEPOSQUERY)
1193 {
1194 active_count--; /* Remove non-match possibility */
1195 next_active_state--;
1196 }
1197 ADD_NEW(state_offset + 2, 0);
1198 }
1199 }
1200 break;
1201
1202 /*-----------------------------------------------------------------*/
1203 case OP_TYPESTAR:
1204 case OP_TYPEMINSTAR:
1205 case OP_TYPEPOSSTAR:
1206 ADD_ACTIVE(state_offset + 2, 0);
1207 if (clen > 0)
1208 {
1209 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1210 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1211 NLBLOCK->nltype == NLTYPE_FIXED &&
1212 NLBLOCK->nllen == 2 &&
1213 c == NLBLOCK->nl[0])
1214 {
1215 could_continue = partial_newline = TRUE;
1216 }
1217 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1218 (c < 256 &&
1219 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1220 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1221 {
1222 if (codevalue == OP_TYPEPOSSTAR)
1223 {
1224 active_count--; /* Remove non-match possibility */
1225 next_active_state--;
1226 }
1227 ADD_NEW(state_offset, 0);
1228 }
1229 }
1230 break;
1231
1232 /*-----------------------------------------------------------------*/
1233 case OP_TYPEEXACT:
1234 count = current_state->count; /* Number already matched */
1235 if (clen > 0)
1236 {
1237 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1238 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1239 NLBLOCK->nltype == NLTYPE_FIXED &&
1240 NLBLOCK->nllen == 2 &&
1241 c == NLBLOCK->nl[0])
1242 {
1243 could_continue = partial_newline = TRUE;
1244 }
1245 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1246 (c < 256 &&
1247 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1248 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1249 {
1250 if (++count >= GET2(code, 1))
1251 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1252 else
1253 { ADD_NEW(state_offset, count); }
1254 }
1255 }
1256 break;
1257
1258 /*-----------------------------------------------------------------*/
1259 case OP_TYPEUPTO:
1260 case OP_TYPEMINUPTO:
1261 case OP_TYPEPOSUPTO:
1262 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1263 count = current_state->count; /* Number already matched */
1264 if (clen > 0)
1265 {
1266 if (d == OP_ANY && ptr + 1 >= md->end_subject &&
1267 (md->moptions & (PCRE_PARTIAL_HARD)) != 0 &&
1268 NLBLOCK->nltype == NLTYPE_FIXED &&
1269 NLBLOCK->nllen == 2 &&
1270 c == NLBLOCK->nl[0])
1271 {
1272 could_continue = partial_newline = TRUE;
1273 }
1274 else if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1275 (c < 256 &&
1276 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1277 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1278 {
1279 if (codevalue == OP_TYPEPOSUPTO)
1280 {
1281 active_count--; /* Remove non-match possibility */
1282 next_active_state--;
1283 }
1284 if (++count >= GET2(code, 1))
1285 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1286 else
1287 { ADD_NEW(state_offset, count); }
1288 }
1289 }
1290 break;
1291
1292 /* ========================================================================== */
1293 /* These are virtual opcodes that are used when something like
1294 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1295 argument. It keeps the code above fast for the other cases. The argument
1296 is in the d variable. */
1297
1298 #ifdef SUPPORT_UCP
1299 case OP_PROP_EXTRA + OP_TYPEPLUS:
1300 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1301 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1302 count = current_state->count; /* Already matched */
1303 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1304 if (clen > 0)
1305 {
1306 BOOL OK;
1307 const pcre_uint32 *cp;
1308 const ucd_record * prop = GET_UCD(c);
1309 switch(code[2])
1310 {
1311 case PT_ANY:
1312 OK = TRUE;
1313 break;
1314
1315 case PT_LAMP:
1316 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1317 prop->chartype == ucp_Lt;
1318 break;
1319
1320 case PT_GC:
1321 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1322 break;
1323
1324 case PT_PC:
1325 OK = prop->chartype == code[3];
1326 break;
1327
1328 case PT_SC:
1329 OK = prop->script == code[3];
1330 break;
1331
1332 /* These are specials for combination cases. */
1333
1334 case PT_ALNUM:
1335 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1336 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1337 break;
1338
1339 case PT_SPACE: /* Perl space */
1340 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1341 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1342 break;
1343
1344 case PT_PXSPACE: /* POSIX space */
1345 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1346 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1347 c == CHAR_FF || c == CHAR_CR;
1348 break;
1349
1350 case PT_WORD:
1351 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1352 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1353 c == CHAR_UNDERSCORE;
1354 break;
1355
1356 case PT_CLIST:
1357 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1358 for (;;)
1359 {
1360 if (c < *cp) { OK = FALSE; break; }
1361 if (c == *cp++) { OK = TRUE; break; }
1362 }
1363 break;
1364
1365 /* Should never occur, but keep compilers from grumbling. */
1366
1367 default:
1368 OK = codevalue != OP_PROP;
1369 break;
1370 }
1371
1372 if (OK == (d == OP_PROP))
1373 {
1374 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1375 {
1376 active_count--; /* Remove non-match possibility */
1377 next_active_state--;
1378 }
1379 count++;
1380 ADD_NEW(state_offset, count);
1381 }
1382 }
1383 break;
1384
1385 /*-----------------------------------------------------------------*/
1386 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1387 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1388 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1389 count = current_state->count; /* Already matched */
1390 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1391 if (clen > 0)
1392 {
1393 int lgb, rgb;
1394 const pcre_uchar *nptr = ptr + clen;
1395 int ncount = 0;
1396 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1397 {
1398 active_count--; /* Remove non-match possibility */
1399 next_active_state--;
1400 }
1401 lgb = UCD_GRAPHBREAK(c);
1402 while (nptr < end_subject)
1403 {
1404 dlen = 1;
1405 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1406 rgb = UCD_GRAPHBREAK(d);
1407 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1408 ncount++;
1409 lgb = rgb;
1410 nptr += dlen;
1411 }
1412 count++;
1413 ADD_NEW_DATA(-state_offset, count, ncount);
1414 }
1415 break;
1416 #endif
1417
1418 /*-----------------------------------------------------------------*/
1419 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1420 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1421 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1422 count = current_state->count; /* Already matched */
1423 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1424 if (clen > 0)
1425 {
1426 int ncount = 0;
1427 switch (c)
1428 {
1429 case CHAR_VT:
1430 case CHAR_FF:
1431 case CHAR_NEL:
1432 #ifndef EBCDIC
1433 case 0x2028:
1434 case 0x2029:
1435 #endif /* Not EBCDIC */
1436 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1437 goto ANYNL01;
1438
1439 case CHAR_CR:
1440 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1441 /* Fall through */
1442
1443 ANYNL01:
1444 case CHAR_LF:
1445 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1446 {
1447 active_count--; /* Remove non-match possibility */
1448 next_active_state--;
1449 }
1450 count++;
1451 ADD_NEW_DATA(-state_offset, count, ncount);
1452 break;
1453
1454 default:
1455 break;
1456 }
1457 }
1458 break;
1459
1460 /*-----------------------------------------------------------------*/
1461 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1462 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1463 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1464 count = current_state->count; /* Already matched */
1465 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1466 if (clen > 0)
1467 {
1468 BOOL OK;
1469 switch (c)
1470 {
1471 VSPACE_CASES:
1472 OK = TRUE;
1473 break;
1474
1475 default:
1476 OK = FALSE;
1477 break;
1478 }
1479
1480 if (OK == (d == OP_VSPACE))
1481 {
1482 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1483 {
1484 active_count--; /* Remove non-match possibility */
1485 next_active_state--;
1486 }
1487 count++;
1488 ADD_NEW_DATA(-state_offset, count, 0);
1489 }
1490 }
1491 break;
1492
1493 /*-----------------------------------------------------------------*/
1494 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1495 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1496 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1497 count = current_state->count; /* Already matched */
1498 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1499 if (clen > 0)
1500 {
1501 BOOL OK;
1502 switch (c)
1503 {
1504 HSPACE_CASES:
1505 OK = TRUE;
1506 break;
1507
1508 default:
1509 OK = FALSE;
1510 break;
1511 }
1512
1513 if (OK == (d == OP_HSPACE))
1514 {
1515 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1516 {
1517 active_count--; /* Remove non-match possibility */
1518 next_active_state--;
1519 }
1520 count++;
1521 ADD_NEW_DATA(-state_offset, count, 0);
1522 }
1523 }
1524 break;
1525
1526 /*-----------------------------------------------------------------*/
1527 #ifdef SUPPORT_UCP
1528 case OP_PROP_EXTRA + OP_TYPEQUERY:
1529 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1530 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1531 count = 4;
1532 goto QS1;
1533
1534 case OP_PROP_EXTRA + OP_TYPESTAR:
1535 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1536 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1537 count = 0;
1538
1539 QS1:
1540
1541 ADD_ACTIVE(state_offset + 4, 0);
1542 if (clen > 0)
1543 {
1544 BOOL OK;
1545 const pcre_uint32 *cp;
1546 const ucd_record * prop = GET_UCD(c);
1547 switch(code[2])
1548 {
1549 case PT_ANY:
1550 OK = TRUE;
1551 break;
1552
1553 case PT_LAMP:
1554 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1555 prop->chartype == ucp_Lt;
1556 break;
1557
1558 case PT_GC:
1559 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1560 break;
1561
1562 case PT_PC:
1563 OK = prop->chartype == code[3];
1564 break;
1565
1566 case PT_SC:
1567 OK = prop->script == code[3];
1568 break;
1569
1570 /* These are specials for combination cases. */
1571
1572 case PT_ALNUM:
1573 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1574 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1575 break;
1576
1577 case PT_SPACE: /* Perl space */
1578 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1579 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1580 break;
1581
1582 case PT_PXSPACE: /* POSIX space */
1583 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1584 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1585 c == CHAR_FF || c == CHAR_CR;
1586 break;
1587
1588 case PT_WORD:
1589 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1590 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1591 c == CHAR_UNDERSCORE;
1592 break;
1593
1594 case PT_CLIST:
1595 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1596 for (;;)
1597 {
1598 if (c < *cp) { OK = FALSE; break; }
1599 if (c == *cp++) { OK = TRUE; break; }
1600 }
1601 break;
1602
1603 /* Should never occur, but keep compilers from grumbling. */
1604
1605 default:
1606 OK = codevalue != OP_PROP;
1607 break;
1608 }
1609
1610 if (OK == (d == OP_PROP))
1611 {
1612 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1613 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1614 {
1615 active_count--; /* Remove non-match possibility */
1616 next_active_state--;
1617 }
1618 ADD_NEW(state_offset + count, 0);
1619 }
1620 }
1621 break;
1622
1623 /*-----------------------------------------------------------------*/
1624 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1625 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1626 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1627 count = 2;
1628 goto QS2;
1629
1630 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1631 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1632 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1633 count = 0;
1634
1635 QS2:
1636
1637 ADD_ACTIVE(state_offset + 2, 0);
1638 if (clen > 0)
1639 {
1640 int lgb, rgb;
1641 const pcre_uchar *nptr = ptr + clen;
1642 int ncount = 0;
1643 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1644 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1645 {
1646 active_count--; /* Remove non-match possibility */
1647 next_active_state--;
1648 }
1649 lgb = UCD_GRAPHBREAK(c);
1650 while (nptr < end_subject)
1651 {
1652 dlen = 1;
1653 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1654 rgb = UCD_GRAPHBREAK(d);
1655 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1656 ncount++;
1657 lgb = rgb;
1658 nptr += dlen;
1659 }
1660 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1661 }
1662 break;
1663 #endif
1664
1665 /*-----------------------------------------------------------------*/
1666 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1667 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1668 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1669 count = 2;
1670 goto QS3;
1671
1672 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1673 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1674 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1675 count = 0;
1676
1677 QS3:
1678 ADD_ACTIVE(state_offset + 2, 0);
1679 if (clen > 0)
1680 {
1681 int ncount = 0;
1682 switch (c)
1683 {
1684 case CHAR_VT:
1685 case CHAR_FF:
1686 case CHAR_NEL:
1687 #ifndef EBCDIC
1688 case 0x2028:
1689 case 0x2029:
1690 #endif /* Not EBCDIC */
1691 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1692 goto ANYNL02;
1693
1694 case CHAR_CR:
1695 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1696 /* Fall through */
1697
1698 ANYNL02:
1699 case CHAR_LF:
1700 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1701 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1702 {
1703 active_count--; /* Remove non-match possibility */
1704 next_active_state--;
1705 }
1706 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1707 break;
1708
1709 default:
1710 break;
1711 }
1712 }
1713 break;
1714
1715 /*-----------------------------------------------------------------*/
1716 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1717 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1718 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1719 count = 2;
1720 goto QS4;
1721
1722 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1723 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1724 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1725 count = 0;
1726
1727 QS4:
1728 ADD_ACTIVE(state_offset + 2, 0);
1729 if (clen > 0)
1730 {
1731 BOOL OK;
1732 switch (c)
1733 {
1734 VSPACE_CASES:
1735 OK = TRUE;
1736 break;
1737
1738 default:
1739 OK = FALSE;
1740 break;
1741 }
1742 if (OK == (d == OP_VSPACE))
1743 {
1744 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1745 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1746 {
1747 active_count--; /* Remove non-match possibility */
1748 next_active_state--;
1749 }
1750 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1751 }
1752 }
1753 break;
1754
1755 /*-----------------------------------------------------------------*/
1756 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1757 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1758 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1759 count = 2;
1760 goto QS5;
1761
1762 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1763 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1764 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1765 count = 0;
1766
1767 QS5:
1768 ADD_ACTIVE(state_offset + 2, 0);
1769 if (clen > 0)
1770 {
1771 BOOL OK;
1772 switch (c)
1773 {
1774 HSPACE_CASES:
1775 OK = TRUE;
1776 break;
1777
1778 default:
1779 OK = FALSE;
1780 break;
1781 }
1782
1783 if (OK == (d == OP_HSPACE))
1784 {
1785 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1786 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1787 {
1788 active_count--; /* Remove non-match possibility */
1789 next_active_state--;
1790 }
1791 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1792 }
1793 }
1794 break;
1795
1796 /*-----------------------------------------------------------------*/
1797 #ifdef SUPPORT_UCP
1798 case OP_PROP_EXTRA + OP_TYPEEXACT:
1799 case OP_PROP_EXTRA + OP_TYPEUPTO:
1800 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1801 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1802 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1803 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1804 count = current_state->count; /* Number already matched */
1805 if (clen > 0)
1806 {
1807 BOOL OK;
1808 const pcre_uint32 *cp;
1809 const ucd_record * prop = GET_UCD(c);
1810 switch(code[1 + IMM2_SIZE + 1])
1811 {
1812 case PT_ANY:
1813 OK = TRUE;
1814 break;
1815
1816 case PT_LAMP:
1817 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1818 prop->chartype == ucp_Lt;
1819 break;
1820
1821 case PT_GC:
1822 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1823 break;
1824
1825 case PT_PC:
1826 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1827 break;
1828
1829 case PT_SC:
1830 OK = prop->script == code[1 + IMM2_SIZE + 2];
1831 break;
1832
1833 /* These are specials for combination cases. */
1834
1835 case PT_ALNUM:
1836 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1837 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1838 break;
1839
1840 case PT_SPACE: /* Perl space */
1841 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1842 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1843 break;
1844
1845 case PT_PXSPACE: /* POSIX space */
1846 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1847 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1848 c == CHAR_FF || c == CHAR_CR;
1849 break;
1850
1851 case PT_WORD:
1852 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1853 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1854 c == CHAR_UNDERSCORE;
1855 break;
1856
1857 case PT_CLIST:
1858 cp = PRIV(ucd_caseless_sets) + prop->caseset;
1859 for (;;)
1860 {
1861 if (c < *cp) { OK = FALSE; break; }
1862 if (c == *cp++) { OK = TRUE; break; }
1863 }
1864 break;
1865
1866 /* Should never occur, but keep compilers from grumbling. */
1867
1868 default:
1869 OK = codevalue != OP_PROP;
1870 break;
1871 }
1872
1873 if (OK == (d == OP_PROP))
1874 {
1875 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1876 {
1877 active_count--; /* Remove non-match possibility */
1878 next_active_state--;
1879 }
1880 if (++count >= GET2(code, 1))
1881 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1882 else
1883 { ADD_NEW(state_offset, count); }
1884 }
1885 }
1886 break;
1887
1888 /*-----------------------------------------------------------------*/
1889 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1890 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1891 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1892 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1893 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1894 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1895 count = current_state->count; /* Number already matched */
1896 if (clen > 0)
1897 {
1898 int lgb, rgb;
1899 const pcre_uchar *nptr = ptr + clen;
1900 int ncount = 0;
1901 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1902 {
1903 active_count--; /* Remove non-match possibility */
1904 next_active_state--;
1905 }
1906 lgb = UCD_GRAPHBREAK(c);
1907 while (nptr < end_subject)
1908 {
1909 dlen = 1;
1910 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
1911 rgb = UCD_GRAPHBREAK(d);
1912 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
1913 ncount++;
1914 lgb = rgb;
1915 nptr += dlen;
1916 }
1917 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
1918 reset_could_continue = TRUE;
1919 if (++count >= GET2(code, 1))
1920 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1921 else
1922 { ADD_NEW_DATA(-state_offset, count, ncount); }
1923 }
1924 break;
1925 #endif
1926
1927 /*-----------------------------------------------------------------*/
1928 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1929 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1930 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1931 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1932 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1933 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1934 count = current_state->count; /* Number already matched */
1935 if (clen > 0)
1936 {
1937 int ncount = 0;
1938 switch (c)
1939 {
1940 case CHAR_VT:
1941 case CHAR_FF:
1942 case CHAR_NEL:
1943 #ifndef EBCDIC
1944 case 0x2028:
1945 case 0x2029:
1946 #endif /* Not EBCDIC */
1947 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1948 goto ANYNL03;
1949
1950 case CHAR_CR:
1951 if (ptr + 1 < end_subject && ptr[1] == CHAR_LF) ncount = 1;
1952 /* Fall through */
1953
1954 ANYNL03:
1955 case CHAR_LF:
1956 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1957 {
1958 active_count--; /* Remove non-match possibility */
1959 next_active_state--;
1960 }
1961 if (++count >= GET2(code, 1))
1962 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1963 else
1964 { ADD_NEW_DATA(-state_offset, count, ncount); }
1965 break;
1966
1967 default:
1968 break;
1969 }
1970 }
1971 break;
1972
1973 /*-----------------------------------------------------------------*/
1974 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1975 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1976 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1977 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1978 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1979 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1980 count = current_state->count; /* Number already matched */
1981 if (clen > 0)
1982 {
1983 BOOL OK;
1984 switch (c)
1985 {
1986 VSPACE_CASES:
1987 OK = TRUE;
1988 break;
1989
1990 default:
1991 OK = FALSE;
1992 }
1993
1994 if (OK == (d == OP_VSPACE))
1995 {
1996 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1997 {
1998 active_count--; /* Remove non-match possibility */
1999 next_active_state--;
2000 }
2001 if (++count >= GET2(code, 1))
2002 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2003 else
2004 { ADD_NEW_DATA(-state_offset, count, 0); }
2005 }
2006 }
2007 break;
2008
2009 /*-----------------------------------------------------------------*/
2010 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
2011 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
2012 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
2013 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
2014 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
2015 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
2016 count = current_state->count; /* Number already matched */
2017 if (clen > 0)
2018 {
2019 BOOL OK;
2020 switch (c)
2021 {
2022 HSPACE_CASES:
2023 OK = TRUE;
2024 break;
2025
2026 default:
2027 OK = FALSE;
2028 break;
2029 }
2030
2031 if (OK == (d == OP_HSPACE))
2032 {
2033 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
2034 {
2035 active_count--; /* Remove non-match possibility */
2036 next_active_state--;
2037 }
2038 if (++count >= GET2(code, 1))
2039 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2040 else
2041 { ADD_NEW_DATA(-state_offset, count, 0); }
2042 }
2043 }
2044 break;
2045
2046 /* ========================================================================== */
2047 /* These opcodes are followed by a character that is usually compared
2048 to the current subject character; it is loaded into d. We still get
2049 here even if there is no subject character, because in some cases zero
2050 repetitions are permitted. */
2051
2052 /*-----------------------------------------------------------------*/
2053 case OP_CHAR:
2054 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2055 break;
2056
2057 /*-----------------------------------------------------------------*/
2058 case OP_CHARI:
2059 if (clen == 0) break;
2060
2061 #ifdef SUPPORT_UTF
2062 if (utf)
2063 {
2064 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2065 {
2066 unsigned int othercase;
2067 if (c < 128)
2068 othercase = fcc[c];
2069 else
2070 /* If we have Unicode property support, we can use it to test the
2071 other case of the character. */
2072 #ifdef SUPPORT_UCP
2073 othercase = UCD_OTHERCASE(c);
2074 #else
2075 othercase = NOTACHAR;
2076 #endif
2077
2078 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2079 }
2080 }
2081 else
2082 #endif /* SUPPORT_UTF */
2083 /* Not UTF mode */
2084 {
2085 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2086 { ADD_NEW(state_offset + 2, 0); }
2087 }
2088 break;
2089
2090
2091 #ifdef SUPPORT_UCP
2092 /*-----------------------------------------------------------------*/
2093 /* This is a tricky one because it can match more than one character.
2094 Find out how many characters to skip, and then set up a negative state
2095 to wait for them to pass before continuing. */
2096
2097 case OP_EXTUNI:
2098 if (clen > 0)
2099 {
2100 int lgb, rgb;
2101 const pcre_uchar *nptr = ptr + clen;
2102 int ncount = 0;
2103 lgb = UCD_GRAPHBREAK(c);
2104 while (nptr < end_subject)
2105 {
2106 dlen = 1;
2107 if (!utf) d = *nptr; else { GETCHARLEN(d, nptr, dlen); }
2108 rgb = UCD_GRAPHBREAK(d);
2109 if ((PRIV(ucp_gbtable)[lgb] & (1 << rgb)) == 0) break;
2110 ncount++;
2111 lgb = rgb;
2112 nptr += dlen;
2113 }
2114 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
2115 reset_could_continue = TRUE;
2116 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2117 }
2118 break;
2119 #endif
2120
2121 /*-----------------------------------------------------------------*/
2122 /* This is a tricky like EXTUNI because it too can match more than one
2123 character (when CR is followed by LF). In this case, set up a negative
2124 state to wait for one character to pass before continuing. */
2125
2126 case OP_ANYNL:
2127 if (clen > 0) switch(c)
2128 {
2129 case CHAR_VT:
2130 case CHAR_FF:
2131 case CHAR_NEL:
2132 #ifndef EBCDIC
2133 case 0x2028:
2134 case 0x2029:
2135 #endif /* Not EBCDIC */
2136 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2137
2138 case CHAR_LF:
2139 ADD_NEW(state_offset + 1, 0);
2140 break;
2141
2142 case CHAR_CR:
2143 if (ptr + 1 >= end_subject)
2144 {
2145 ADD_NEW(state_offset + 1, 0);
2146 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
2147 reset_could_continue = TRUE;
2148 }
2149 else if (ptr[1] == CHAR_LF)
2150 {
2151 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2152 }
2153 else
2154 {
2155 ADD_NEW(state_offset + 1, 0);
2156 }
2157 break;
2158 }
2159 break;
2160
2161 /*-----------------------------------------------------------------*/
2162 case OP_NOT_VSPACE:
2163 if (clen > 0) switch(c)
2164 {
2165 VSPACE_CASES:
2166 break;
2167
2168 default:
2169 ADD_NEW(state_offset + 1, 0);
2170 break;
2171 }
2172 break;
2173
2174 /*-----------------------------------------------------------------*/
2175 case OP_VSPACE:
2176 if (clen > 0) switch(c)
2177 {
2178 VSPACE_CASES:
2179 ADD_NEW(state_offset + 1, 0);
2180 break;
2181
2182 default:
2183 break;
2184 }
2185 break;
2186
2187 /*-----------------------------------------------------------------*/
2188 case OP_NOT_HSPACE:
2189 if (clen > 0) switch(c)
2190 {
2191 HSPACE_CASES:
2192 break;
2193
2194 default:
2195 ADD_NEW(state_offset + 1, 0);
2196 break;
2197 }
2198 break;
2199
2200 /*-----------------------------------------------------------------*/
2201 case OP_HSPACE:
2202 if (clen > 0) switch(c)
2203 {
2204 HSPACE_CASES:
2205 ADD_NEW(state_offset + 1, 0);
2206 break;
2207
2208 default:
2209 break;
2210 }
2211 break;
2212
2213 /*-----------------------------------------------------------------*/
2214 /* Match a negated single character casefully. */
2215
2216 case OP_NOT:
2217 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2218 break;
2219
2220 /*-----------------------------------------------------------------*/
2221 /* Match a negated single character caselessly. */
2222
2223 case OP_NOTI:
2224 if (clen > 0)
2225 {
2226 unsigned int otherd;
2227 #ifdef SUPPORT_UTF
2228 if (utf && d >= 128)
2229 {
2230 #ifdef SUPPORT_UCP
2231 otherd = UCD_OTHERCASE(d);
2232 #endif /* SUPPORT_UCP */
2233 }
2234 else
2235 #endif /* SUPPORT_UTF */
2236 otherd = TABLE_GET(d, fcc, d);
2237 if (c != d && c != otherd)
2238 { ADD_NEW(state_offset + dlen + 1, 0); }
2239 }
2240 break;
2241
2242 /*-----------------------------------------------------------------*/
2243 case OP_PLUSI:
2244 case OP_MINPLUSI:
2245 case OP_POSPLUSI:
2246 case OP_NOTPLUSI:
2247 case OP_NOTMINPLUSI:
2248 case OP_NOTPOSPLUSI:
2249 caseless = TRUE;
2250 codevalue -= OP_STARI - OP_STAR;
2251
2252 /* Fall through */
2253 case OP_PLUS:
2254 case OP_MINPLUS:
2255 case OP_POSPLUS:
2256 case OP_NOTPLUS:
2257 case OP_NOTMINPLUS:
2258 case OP_NOTPOSPLUS:
2259 count = current_state->count; /* Already matched */
2260 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2261 if (clen > 0)
2262 {
2263 unsigned int otherd = NOTACHAR;
2264 if (caseless)
2265 {
2266 #ifdef SUPPORT_UTF
2267 if (utf && d >= 128)
2268 {
2269 #ifdef SUPPORT_UCP
2270 otherd = UCD_OTHERCASE(d);
2271 #endif /* SUPPORT_UCP */
2272 }
2273 else
2274 #endif /* SUPPORT_UTF */
2275 otherd = TABLE_GET(d, fcc, d);
2276 }
2277 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2278 {
2279 if (count > 0 &&
2280 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2281 {
2282 active_count--; /* Remove non-match possibility */
2283 next_active_state--;
2284 }
2285 count++;
2286 ADD_NEW(state_offset, count);
2287 }
2288 }
2289 break;
2290
2291 /*-----------------------------------------------------------------*/
2292 case OP_QUERYI:
2293 case OP_MINQUERYI:
2294 case OP_POSQUERYI:
2295 case OP_NOTQUERYI:
2296 case OP_NOTMINQUERYI:
2297 case OP_NOTPOSQUERYI:
2298 caseless = TRUE;
2299 codevalue -= OP_STARI - OP_STAR;
2300 /* Fall through */
2301 case OP_QUERY:
2302 case OP_MINQUERY:
2303 case OP_POSQUERY:
2304 case OP_NOTQUERY:
2305 case OP_NOTMINQUERY:
2306 case OP_NOTPOSQUERY:
2307 ADD_ACTIVE(state_offset + dlen + 1, 0);
2308 if (clen > 0)
2309 {
2310 unsigned int otherd = NOTACHAR;
2311 if (caseless)
2312 {
2313 #ifdef SUPPORT_UTF
2314 if (utf && d >= 128)
2315 {
2316 #ifdef SUPPORT_UCP
2317 otherd = UCD_OTHERCASE(d);
2318 #endif /* SUPPORT_UCP */
2319 }
2320 else
2321 #endif /* SUPPORT_UTF */
2322 otherd = TABLE_GET(d, fcc, d);
2323 }
2324 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2325 {
2326 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2327 {
2328 active_count--; /* Remove non-match possibility */
2329 next_active_state--;
2330 }
2331 ADD_NEW(state_offset + dlen + 1, 0);
2332 }
2333 }
2334 break;
2335
2336 /*-----------------------------------------------------------------*/
2337 case OP_STARI:
2338 case OP_MINSTARI:
2339 case OP_POSSTARI:
2340 case OP_NOTSTARI:
2341 case OP_NOTMINSTARI:
2342 case OP_NOTPOSSTARI:
2343 caseless = TRUE;
2344 codevalue -= OP_STARI - OP_STAR;
2345 /* Fall through */
2346 case OP_STAR:
2347 case OP_MINSTAR:
2348 case OP_POSSTAR:
2349 case OP_NOTSTAR:
2350 case OP_NOTMINSTAR:
2351 case OP_NOTPOSSTAR:
2352 ADD_ACTIVE(state_offset + dlen + 1, 0);
2353 if (clen > 0)
2354 {
2355 unsigned int otherd = NOTACHAR;
2356 if (caseless)
2357 {
2358 #ifdef SUPPORT_UTF
2359 if (utf && d >= 128)
2360 {
2361 #ifdef SUPPORT_UCP
2362 otherd = UCD_OTHERCASE(d);
2363 #endif /* SUPPORT_UCP */
2364 }
2365 else
2366 #endif /* SUPPORT_UTF */
2367 otherd = TABLE_GET(d, fcc, d);
2368 }
2369 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2370 {
2371 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2372 {
2373 active_count--; /* Remove non-match possibility */
2374 next_active_state--;
2375 }
2376 ADD_NEW(state_offset, 0);
2377 }
2378 }
2379 break;
2380
2381 /*-----------------------------------------------------------------*/
2382 case OP_EXACTI:
2383 case OP_NOTEXACTI:
2384 caseless = TRUE;
2385 codevalue -= OP_STARI - OP_STAR;
2386 /* Fall through */
2387 case OP_EXACT:
2388 case OP_NOTEXACT:
2389 count = current_state->count; /* Number already matched */
2390 if (clen > 0)
2391 {
2392 unsigned int otherd = NOTACHAR;
2393 if (caseless)
2394 {
2395 #ifdef SUPPORT_UTF
2396 if (utf && d >= 128)
2397 {
2398 #ifdef SUPPORT_UCP
2399 otherd = UCD_OTHERCASE(d);
2400 #endif /* SUPPORT_UCP */
2401 }
2402 else
2403 #endif /* SUPPORT_UTF */
2404 otherd = TABLE_GET(d, fcc, d);
2405 }
2406 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2407 {
2408 if (++count >= GET2(code, 1))
2409 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2410 else
2411 { ADD_NEW(state_offset, count); }
2412 }
2413 }
2414 break;
2415
2416 /*-----------------------------------------------------------------*/
2417 case OP_UPTOI:
2418 case OP_MINUPTOI:
2419 case OP_POSUPTOI:
2420 case OP_NOTUPTOI:
2421 case OP_NOTMINUPTOI:
2422 case OP_NOTPOSUPTOI:
2423 caseless = TRUE;
2424 codevalue -= OP_STARI - OP_STAR;
2425 /* Fall through */
2426 case OP_UPTO:
2427 case OP_MINUPTO:
2428 case OP_POSUPTO:
2429 case OP_NOTUPTO:
2430 case OP_NOTMINUPTO:
2431 case OP_NOTPOSUPTO:
2432 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2433 count = current_state->count; /* Number already matched */
2434 if (clen > 0)
2435 {
2436 unsigned int otherd = NOTACHAR;
2437 if (caseless)
2438 {
2439 #ifdef SUPPORT_UTF
2440 if (utf && d >= 128)
2441 {
2442 #ifdef SUPPORT_UCP
2443 otherd = UCD_OTHERCASE(d);
2444 #endif /* SUPPORT_UCP */
2445 }
2446 else
2447 #endif /* SUPPORT_UTF */
2448 otherd = TABLE_GET(d, fcc, d);
2449 }
2450 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2451 {
2452 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2453 {
2454 active_count--; /* Remove non-match possibility */
2455 next_active_state--;
2456 }
2457 if (++count >= GET2(code, 1))
2458 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2459 else
2460 { ADD_NEW(state_offset, count); }
2461 }
2462 }
2463 break;
2464
2465
2466 /* ========================================================================== */
2467 /* These are the class-handling opcodes */
2468
2469 case OP_CLASS:
2470 case OP_NCLASS:
2471 case OP_XCLASS:
2472 {
2473 BOOL isinclass = FALSE;
2474 int next_state_offset;
2475 const pcre_uchar *ecode;
2476
2477 /* For a simple class, there is always just a 32-byte table, and we
2478 can set isinclass from it. */
2479
2480 if (codevalue != OP_XCLASS)
2481 {
2482 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2483 if (clen > 0)
2484 {
2485 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2486 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2487 }
2488 }
2489
2490 /* An extended class may have a table or a list of single characters,
2491 ranges, or both, and it may be positive or negative. There's a
2492 function that sorts all this out. */
2493
2494 else
2495 {
2496 ecode = code + GET(code, 1);
2497 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2498 }
2499
2500 /* At this point, isinclass is set for all kinds of class, and ecode
2501 points to the byte after the end of the class. If there is a
2502 quantifier, this is where it will be. */
2503
2504 next_state_offset = (int)(ecode - start_code);
2505
2506 switch (*ecode)
2507 {
2508 case OP_CRSTAR:
2509 case OP_CRMINSTAR:
2510 ADD_ACTIVE(next_state_offset + 1, 0);
2511 if (isinclass) { ADD_NEW(state_offset, 0); }
2512 break;
2513
2514 case OP_CRPLUS:
2515 case OP_CRMINPLUS:
2516 count = current_state->count; /* Already matched */
2517 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2518 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2519 break;
2520
2521 case OP_CRQUERY:
2522 case OP_CRMINQUERY:
2523 ADD_ACTIVE(next_state_offset + 1, 0);
2524 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2525 break;
2526
2527 case OP_CRRANGE:
2528 case OP_CRMINRANGE:
2529 count = current_state->count; /* Already matched */
2530 if (count >= GET2(ecode, 1))
2531 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2532 if (isinclass)
2533 {
2534 int max = GET2(ecode, 1 + IMM2_SIZE);
2535 if (++count >= max && max != 0) /* Max 0 => no limit */
2536 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2537 else
2538 { ADD_NEW(state_offset, count); }
2539 }
2540 break;
2541
2542 default:
2543 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2544 break;
2545 }
2546 }
2547 break;
2548
2549 /* ========================================================================== */
2550 /* These are the opcodes for fancy brackets of various kinds. We have
2551 to use recursion in order to handle them. The "always failing" assertion
2552 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2553 though the other "backtracking verbs" are not supported. */
2554
2555 case OP_FAIL:
2556 forced_fail++; /* Count FAILs for multiple states */
2557 break;
2558
2559 case OP_ASSERT:
2560 case OP_ASSERT_NOT:
2561 case OP_ASSERTBACK:
2562 case OP_ASSERTBACK_NOT:
2563 {
2564 int rc;
2565 int local_offsets[2];
2566 int local_workspace[1000];
2567 const pcre_uchar *endasscode = code + GET(code, 1);
2568
2569 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2570
2571 rc = internal_dfa_exec(
2572 md, /* static match data */
2573 code, /* this subexpression's code */
2574 ptr, /* where we currently are */
2575 (int)(ptr - start_subject), /* start offset */
2576 local_offsets, /* offset vector */
2577 sizeof(local_offsets)/sizeof(int), /* size of same */
2578 local_workspace, /* workspace vector */
2579 sizeof(local_workspace)/sizeof(int), /* size of same */
2580 rlevel); /* function recursion level */
2581
2582 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2583 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2584 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2585 }
2586 break;
2587
2588 /*-----------------------------------------------------------------*/
2589 case OP_COND:
2590 case OP_SCOND:
2591 {
2592 int local_offsets[1000];
2593 int local_workspace[1000];
2594 int codelink = GET(code, 1);
2595 int condcode;
2596
2597 /* Because of the way auto-callout works during compile, a callout item
2598 is inserted between OP_COND and an assertion condition. This does not
2599 happen for the other conditions. */
2600
2601 if (code[LINK_SIZE+1] == OP_CALLOUT)
2602 {
2603 rrc = 0;
2604 if (PUBL(callout) != NULL)
2605 {
2606 PUBL(callout_block) cb;
2607 cb.version = 1; /* Version 1 of the callout block */
2608 cb.callout_number = code[LINK_SIZE+2];
2609 cb.offset_vector = offsets;
2610 #ifdef COMPILE_PCRE8
2611 cb.subject = (PCRE_SPTR)start_subject;
2612 #else
2613 cb.subject = (PCRE_SPTR16)start_subject;
2614 #endif
2615 cb.subject_length = (int)(end_subject - start_subject);
2616 cb.start_match = (int)(current_subject - start_subject);
2617 cb.current_position = (int)(ptr - start_subject);
2618 cb.pattern_position = GET(code, LINK_SIZE + 3);
2619 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2620 cb.capture_top = 1;
2621 cb.capture_last = -1;
2622 cb.callout_data = md->callout_data;
2623 cb.mark = NULL; /* No (*MARK) support */
2624 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2625 }
2626 if (rrc > 0) break; /* Fail this thread */
2627 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2628 }
2629
2630 condcode = code[LINK_SIZE+1];
2631
2632 /* Back reference conditions are not supported */
2633
2634 if (condcode == OP_CREF || condcode == OP_NCREF)
2635 return PCRE_ERROR_DFA_UCOND;
2636
2637 /* The DEFINE condition is always false */
2638
2639 if (condcode == OP_DEF)
2640 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2641
2642 /* The only supported version of OP_RREF is for the value RREF_ANY,
2643 which means "test if in any recursion". We can't test for specifically
2644 recursed groups. */
2645
2646 else if (condcode == OP_RREF || condcode == OP_NRREF)
2647 {
2648 int value = GET2(code, LINK_SIZE + 2);
2649 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2650 if (md->recursive != NULL)
2651 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2652 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2653 }
2654
2655 /* Otherwise, the condition is an assertion */
2656
2657 else
2658 {
2659 int rc;
2660 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2661 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2662
2663 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2664
2665 rc = internal_dfa_exec(
2666 md, /* fixed match data */
2667 asscode, /* this subexpression's code */
2668 ptr, /* where we currently are */
2669 (int)(ptr - start_subject), /* start offset */
2670 local_offsets, /* offset vector */
2671 sizeof(local_offsets)/sizeof(int), /* size of same */
2672 local_workspace, /* workspace vector */
2673 sizeof(local_workspace)/sizeof(int), /* size of same */
2674 rlevel); /* function recursion level */
2675
2676 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2677 if ((rc >= 0) ==
2678 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2679 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2680 else
2681 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2682 }
2683 }
2684 break;
2685
2686 /*-----------------------------------------------------------------*/
2687 case OP_RECURSE:
2688 {
2689 dfa_recursion_info *ri;
2690 int local_offsets[1000];
2691 int local_workspace[1000];
2692 const pcre_uchar *callpat = start_code + GET(code, 1);
2693 int recno = (callpat == md->start_code)? 0 :
2694 GET2(callpat, 1 + LINK_SIZE);
2695 int rc;
2696
2697 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2698
2699 /* Check for repeating a recursion without advancing the subject
2700 pointer. This should catch convoluted mutual recursions. (Some simple
2701 cases are caught at compile time.) */
2702
2703 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2704 if (recno == ri->group_num && ptr == ri->subject_position)
2705 return PCRE_ERROR_RECURSELOOP;
2706
2707 /* Remember this recursion and where we started it so as to
2708 catch infinite loops. */
2709
2710 new_recursive.group_num = recno;
2711 new_recursive.subject_position = ptr;
2712 new_recursive.prevrec = md->recursive;
2713 md->recursive = &new_recursive;
2714
2715 rc = internal_dfa_exec(
2716 md, /* fixed match data */
2717 callpat, /* this subexpression's code */
2718 ptr, /* where we currently are */
2719 (int)(ptr - start_subject), /* start offset */
2720 local_offsets, /* offset vector */
2721 sizeof(local_offsets)/sizeof(int), /* size of same */
2722 local_workspace, /* workspace vector */
2723 sizeof(local_workspace)/sizeof(int), /* size of same */
2724 rlevel); /* function recursion level */
2725
2726 md->recursive = new_recursive.prevrec; /* Done this recursion */
2727
2728 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2729 rc));
2730
2731 /* Ran out of internal offsets */
2732
2733 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2734
2735 /* For each successful matched substring, set up the next state with a
2736 count of characters to skip before trying it. Note that the count is in
2737 characters, not bytes. */
2738
2739 if (rc > 0)
2740 {
2741 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2742 {
2743 int charcount = local_offsets[rc+1] - local_offsets[rc];
2744 #ifdef SUPPORT_UTF
2745 if (utf)
2746 {
2747 const pcre_uchar *p = start_subject + local_offsets[rc];
2748 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2749 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2750 }
2751 #endif
2752 if (charcount > 0)
2753 {
2754 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2755 }
2756 else
2757 {
2758 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2759 }
2760 }
2761 }
2762 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2763 }
2764 break;
2765
2766 /*-----------------------------------------------------------------*/
2767 case OP_BRAPOS:
2768 case OP_SBRAPOS:
2769 case OP_CBRAPOS:
2770 case OP_SCBRAPOS:
2771 case OP_BRAPOSZERO:
2772 {
2773 int charcount, matched_count;
2774 const pcre_uchar *local_ptr = ptr;
2775 BOOL allow_zero;
2776
2777 if (codevalue == OP_BRAPOSZERO)
2778 {
2779 allow_zero = TRUE;
2780 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2781 }
2782 else allow_zero = FALSE;
2783
2784 /* Loop to match the subpattern as many times as possible as if it were
2785 a complete pattern. */
2786
2787 for (matched_count = 0;; matched_count++)
2788 {
2789 int local_offsets[2];
2790 int local_workspace[1000];
2791
2792 int rc = internal_dfa_exec(
2793 md, /* fixed match data */
2794 code, /* this subexpression's code */
2795 local_ptr, /* where we currently are */
2796 (int)(ptr - start_subject), /* start offset */
2797 local_offsets, /* offset vector */
2798 sizeof(local_offsets)/sizeof(int), /* size of same */
2799 local_workspace, /* workspace vector */
2800 sizeof(local_workspace)/sizeof(int), /* size of same */
2801 rlevel); /* function recursion level */
2802
2803 /* Failed to match */
2804
2805 if (rc < 0)
2806 {
2807 if (rc != PCRE_ERROR_NOMATCH) return rc;
2808 break;
2809 }
2810
2811 /* Matched: break the loop if zero characters matched. */
2812
2813 charcount = local_offsets[1] - local_offsets[0];
2814 if (charcount == 0) break;
2815 local_ptr += charcount; /* Advance temporary position ptr */
2816 }
2817
2818 /* At this point we have matched the subpattern matched_count
2819 times, and local_ptr is pointing to the character after the end of the
2820 last match. */
2821
2822 if (matched_count > 0 || allow_zero)
2823 {
2824 const pcre_uchar *end_subpattern = code;
2825 int next_state_offset;
2826
2827 do { end_subpattern += GET(end_subpattern, 1); }
2828 while (*end_subpattern == OP_ALT);
2829 next_state_offset =
2830 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2831
2832 /* Optimization: if there are no more active states, and there
2833 are no new states yet set up, then skip over the subject string
2834 right here, to save looping. Otherwise, set up the new state to swing
2835 into action when the end of the matched substring is reached. */
2836
2837 if (i + 1 >= active_count && new_count == 0)
2838 {
2839 ptr = local_ptr;
2840 clen = 0;
2841 ADD_NEW(next_state_offset, 0);
2842 }
2843 else
2844 {
2845 const pcre_uchar *p = ptr;
2846 const pcre_uchar *pp = local_ptr;
2847 charcount = (int)(pp - p);
2848 #ifdef SUPPORT_UTF
2849 if (utf) while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2850 #endif
2851 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2852 }
2853 }
2854 }
2855 break;
2856
2857 /*-----------------------------------------------------------------*/
2858 case OP_ONCE:
2859 case OP_ONCE_NC:
2860 {
2861 int local_offsets[2];
2862 int local_workspace[1000];
2863
2864 int rc = internal_dfa_exec(
2865 md, /* fixed match data */
2866 code, /* this subexpression's code */
2867 ptr, /* where we currently are */
2868 (int)(ptr - start_subject), /* start offset */
2869 local_offsets, /* offset vector */
2870 sizeof(local_offsets)/sizeof(int), /* size of same */
2871 local_workspace, /* workspace vector */
2872 sizeof(local_workspace)/sizeof(int), /* size of same */
2873 rlevel); /* function recursion level */
2874
2875 if (rc >= 0)
2876 {
2877 const pcre_uchar *end_subpattern = code;
2878 int charcount = local_offsets[1] - local_offsets[0];
2879 int next_state_offset, repeat_state_offset;
2880
2881 do { end_subpattern += GET(end_subpattern, 1); }
2882 while (*end_subpattern == OP_ALT);
2883 next_state_offset =
2884 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2885
2886 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2887 arrange for the repeat state also to be added to the relevant list.
2888 Calculate the offset, or set -1 for no repeat. */
2889
2890 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2891 *end_subpattern == OP_KETRMIN)?
2892 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2893
2894 /* If we have matched an empty string, add the next state at the
2895 current character pointer. This is important so that the duplicate
2896 checking kicks in, which is what breaks infinite loops that match an
2897 empty string. */
2898
2899 if (charcount == 0)
2900 {
2901 ADD_ACTIVE(next_state_offset, 0);
2902 }
2903
2904 /* Optimization: if there are no more active states, and there
2905 are no new states yet set up, then skip over the subject string
2906 right here, to save looping. Otherwise, set up the new state to swing
2907 into action when the end of the matched substring is reached. */
2908
2909 else if (i + 1 >= active_count && new_count == 0)
2910 {
2911 ptr += charcount;
2912 clen = 0;
2913 ADD_NEW(next_state_offset, 0);
2914
2915 /* If we are adding a repeat state at the new character position,
2916 we must fudge things so that it is the only current state.
2917 Otherwise, it might be a duplicate of one we processed before, and
2918 that would cause it to be skipped. */
2919
2920 if (repeat_state_offset >= 0)
2921 {
2922 next_active_state = active_states;
2923 active_count = 0;
2924 i = -1;
2925 ADD_ACTIVE(repeat_state_offset, 0);
2926 }
2927 }
2928 else
2929 {
2930 #ifdef SUPPORT_UTF
2931 if (utf)
2932 {
2933 const pcre_uchar *p = start_subject + local_offsets[0];
2934 const pcre_uchar *pp = start_subject + local_offsets[1];
2935 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2936 }
2937 #endif
2938 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2939 if (repeat_state_offset >= 0)
2940 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2941 }
2942 }
2943 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2944 }
2945 break;
2946
2947
2948 /* ========================================================================== */
2949 /* Handle callouts */
2950
2951 case OP_CALLOUT:
2952 rrc = 0;
2953 if (PUBL(callout) != NULL)
2954 {
2955 PUBL(callout_block) cb;
2956 cb.version = 1; /* Version 1 of the callout block */
2957 cb.callout_number = code[1];
2958 cb.offset_vector = offsets;
2959 #ifdef COMPILE_PCRE8
2960 cb.subject = (PCRE_SPTR)start_subject;
2961 #else
2962 cb.subject = (PCRE_SPTR16)start_subject;
2963 #endif
2964 cb.subject_length = (int)(end_subject - start_subject);
2965 cb.start_match = (int)(current_subject - start_subject);
2966 cb.current_position = (int)(ptr - start_subject);
2967 cb.pattern_position = GET(code, 2);
2968 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2969 cb.capture_top = 1;
2970 cb.capture_last = -1;
2971 cb.callout_data = md->callout_data;
2972 cb.mark = NULL; /* No (*MARK) support */
2973 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2974 }
2975 if (rrc == 0)
2976 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
2977 break;
2978
2979
2980 /* ========================================================================== */
2981 default: /* Unsupported opcode */
2982 return PCRE_ERROR_DFA_UITEM;
2983 }
2984
2985 NEXT_ACTIVE_STATE: continue;
2986
2987 } /* End of loop scanning active states */
2988
2989 /* We have finished the processing at the current subject character. If no
2990 new states have been set for the next character, we have found all the
2991 matches that we are going to find. If we are at the top level and partial
2992 matching has been requested, check for appropriate conditions.
2993
2994 The "forced_ fail" variable counts the number of (*F) encountered for the
2995 character. If it is equal to the original active_count (saved in
2996 workspace[1]) it means that (*F) was found on every active state. In this
2997 case we don't want to give a partial match.
2998
2999 The "could_continue" variable is true if a state could have continued but
3000 for the fact that the end of the subject was reached. */
3001
3002 if (new_count <= 0)
3003 {
3004 if (rlevel == 1 && /* Top level, and */
3005 could_continue && /* Some could go on, and */
3006 forced_fail != workspace[1] && /* Not all forced fail & */
3007 ( /* either... */
3008 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
3009 || /* or... */
3010 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
3011 match_count < 0) /* no matches */
3012 ) && /* And... */
3013 (
3014 partial_newline || /* Either partial NL */
3015 ( /* or ... */
3016 ptr >= end_subject && /* End of subject and */
3017 ptr > md->start_used_ptr) /* Inspected non-empty string */
3018 )
3019 )
3020 {
3021 if (offsetcount >= 2)
3022 {
3023 offsets[0] = (int)(md->start_used_ptr - start_subject);
3024 offsets[1] = (int)(end_subject - start_subject);
3025 }
3026 match_count = PCRE_ERROR_PARTIAL;
3027 }
3028
3029 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
3030 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
3031 rlevel*2-2, SP));
3032 break; /* In effect, "return", but see the comment below */
3033 }
3034
3035 /* One or more states are active for the next character. */
3036
3037 ptr += clen; /* Advance to next subject character */
3038 } /* Loop to move along the subject string */
3039
3040 /* Control gets here from "break" a few lines above. We do it this way because
3041 if we use "return" above, we have compiler trouble. Some compilers warn if
3042 there's nothing here because they think the function doesn't return a value. On
3043 the other hand, if we put a dummy statement here, some more clever compilers
3044 complain that it can't be reached. Sigh. */
3045
3046 return match_count;
3047 }
3048
3049
3050
3051
3052 /*************************************************
3053 * Execute a Regular Expression - DFA engine *
3054 *************************************************/
3055
3056 /* This external function applies a compiled re to a subject string using a DFA
3057 engine. This function calls the internal function multiple times if the pattern
3058 is not anchored.
3059
3060 Arguments:
3061 argument_re points to the compiled expression
3062 extra_data points to extra data or is NULL
3063 subject points to the subject string
3064 length length of subject string (may contain binary zeros)
3065 start_offset where to start in the subject string
3066 options option bits
3067 offsets vector of match offsets
3068 offsetcount size of same
3069 workspace workspace vector
3070 wscount size of same
3071
3072 Returns: > 0 => number of match offset pairs placed in offsets
3073 = 0 => offsets overflowed; longest matches are present
3074 -1 => failed to match
3075 < -1 => some kind of unexpected problem
3076 */
3077
3078 #ifdef COMPILE_PCRE8
3079 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3080 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3081 const char *subject, int length, int start_offset, int options, int *offsets,
3082 int offsetcount, int *workspace, int wscount)
3083 #else
3084 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3085 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3086 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3087 int offsetcount, int *workspace, int wscount)
3088 #endif
3089 {
3090 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3091 dfa_match_data match_block;
3092 dfa_match_data *md = &match_block;
3093 BOOL utf, anchored, startline, firstline;
3094 const pcre_uchar *current_subject, *end_subject;
3095 const pcre_study_data *study = NULL;
3096
3097 const pcre_uchar *req_char_ptr;
3098 const pcre_uint8 *start_bits = NULL;
3099 BOOL has_first_char = FALSE;
3100 BOOL has_req_char = FALSE;
3101 pcre_uchar first_char = 0;
3102 pcre_uchar first_char2 = 0;
3103 pcre_uchar req_char = 0;
3104 pcre_uchar req_char2 = 0;
3105 int newline;
3106
3107 /* Plausibility checks */
3108
3109 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3110 if (re == NULL || subject == NULL || workspace == NULL ||
3111 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3112 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3113 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3114 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3115
3116 /* Check that the first field in the block is the magic number. If it is not,
3117 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3118 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3119 means that the pattern is likely compiled with different endianness. */
3120
3121 if (re->magic_number != MAGIC_NUMBER)
3122 return re->magic_number == REVERSED_MAGIC_NUMBER?
3123 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3124 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3125
3126 /* If restarting after a partial match, do some sanity checks on the contents
3127 of the workspace. */
3128
3129 if ((options & PCRE_DFA_RESTART) != 0)
3130 {
3131 if ((workspace[0] & (-2)) != 0 || workspace[1] < 1 ||
3132 workspace[1] > (wscount - 2)/INTS_PER_STATEBLOCK)
3133 return PCRE_ERROR_DFA_BADRESTART;
3134 }
3135
3136 /* Set up study, callout, and table data */
3137
3138 md->tables = re->tables;
3139 md->callout_data = NULL;
3140
3141 if (extra_data != NULL)
3142 {
3143 unsigned int flags = extra_data->flags;
3144 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3145 study = (const pcre_study_data *)extra_data->study_data;
3146 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3147 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3148 return PCRE_ERROR_DFA_UMLIMIT;
3149 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3150 md->callout_data = extra_data->callout_data;
3151 if ((flags & PCRE_EXTRA_TABLES) != 0)
3152 md->tables = extra_data->tables;
3153 }
3154
3155 /* Set some local values */
3156
3157 current_subject = (const pcre_uchar *)subject + start_offset;
3158 end_subject = (const pcre_uchar *)subject + length;
3159 req_char_ptr = current_subject - 1;
3160
3161 #ifdef SUPPORT_UTF
3162 /* PCRE_UTF16 has the same value as PCRE_UTF8. */
3163 utf = (re->options & PCRE_UTF8) != 0;
3164 #else
3165 utf = FALSE;
3166 #endif
3167
3168 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3169 (re->options & PCRE_ANCHORED) != 0;
3170
3171 /* The remaining fixed data for passing around. */
3172
3173 md->start_code = (const pcre_uchar *)argument_re +
3174 re->name_table_offset + re->name_count * re->name_entry_size;
3175 md->start_subject = (const pcre_uchar *)subject;
3176 md->end_subject = end_subject;
3177 md->start_offset = start_offset;
3178 md->moptions = options;
3179 md->poptions = re->options;
3180
3181 /* If the BSR option is not set at match time, copy what was set
3182 at compile time. */
3183
3184 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3185 {
3186 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3187 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3188 #ifdef BSR_ANYCRLF
3189 else md->moptions |= PCRE_BSR_ANYCRLF;
3190 #endif
3191 }
3192
3193 /* Handle different types of newline. The three bits give eight cases. If
3194 nothing is set at run time, whatever was used at compile time applies. */
3195
3196 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3197 PCRE_NEWLINE_BITS)
3198 {
3199 case 0: newline = NEWLINE; break; /* Compile-time default */
3200 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3201 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3202 case PCRE_NEWLINE_CR+
3203 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3204 case PCRE_NEWLINE_ANY: newline = -1; break;
3205 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3206 default: return PCRE_ERROR_BADNEWLINE;
3207 }
3208
3209 if (newline == -2)
3210 {
3211 md->nltype = NLTYPE_ANYCRLF;
3212 }
3213 else if (newline < 0)
3214 {
3215 md->nltype = NLTYPE_ANY;
3216 }
3217 else
3218 {
3219 md->nltype = NLTYPE_FIXED;
3220 if (newline > 255)
3221 {
3222 md->nllen = 2;
3223 md->nl[0] = (newline >> 8) & 255;
3224 md->nl[1] = newline & 255;
3225 }
3226 else
3227 {
3228 md->nllen = 1;
3229 md->nl[0] = newline;
3230 }
3231 }
3232
3233 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3234 back the character offset. */
3235
3236 #ifdef SUPPORT_UTF
3237 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3238 {
3239 int erroroffset;
3240 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3241 if (errorcode != 0)
3242 {
3243 if (offsetcount >= 2)
3244 {
3245 offsets[0] = erroroffset;
3246 offsets[1] = errorcode;
3247 }
3248 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3249 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3250 }
3251 if (start_offset > 0 && start_offset < length &&
3252 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3253 return PCRE_ERROR_BADUTF8_OFFSET;
3254 }
3255 #endif
3256
3257 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3258 is a feature that makes it possible to save compiled regex and re-use them
3259 in other programs later. */
3260
3261 if (md->tables == NULL) md->tables = PRIV(default_tables);
3262
3263 /* The "must be at the start of a line" flags are used in a loop when finding
3264 where to start. */
3265
3266 startline = (re->flags & PCRE_STARTLINE) != 0;
3267 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3268
3269 /* Set up the first character to match, if available. The first_byte value is
3270 never set for an anchored regular expression, but the anchoring may be forced
3271 at run time, so we have to test for anchoring. The first char may be unset for
3272 an unanchored pattern, of course. If there's no first char and the pattern was
3273 studied, there may be a bitmap of possible first characters. */
3274
3275 if (!anchored)
3276 {
3277 if ((re->flags & PCRE_FIRSTSET) != 0)
3278 {
3279 has_first_char = TRUE;
3280 first_char = first_char2 = (pcre_uchar)(re->first_char);
3281 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3282 {
3283 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3284 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3285 if (utf && first_char > 127)
3286 first_char2 = UCD_OTHERCASE(first_char);
3287 #endif
3288 }
3289 }
3290 else
3291 {
3292 if (!startline && study != NULL &&
3293 (study->flags & PCRE_STUDY_MAPPED) != 0)
3294 start_bits = study->start_bits;
3295 }
3296 }
3297
3298 /* For anchored or unanchored matches, there may be a "last known required
3299 character" set. */
3300
3301 if ((re->flags & PCRE_REQCHSET) != 0)
3302 {
3303 has_req_char = TRUE;
3304 req_char = req_char2 = (pcre_uchar)(re->req_char);
3305 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3306 {
3307 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3308 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3309 if (utf && req_char > 127)
3310 req_char2 = UCD_OTHERCASE(req_char);
3311 #endif
3312 }
3313 }
3314
3315 /* Call the main matching function, looping for a non-anchored regex after a
3316 failed match. If not restarting, perform certain optimizations at the start of
3317 a match. */
3318
3319 for (;;)
3320 {
3321 int rc;
3322
3323 if ((options & PCRE_DFA_RESTART) == 0)
3324 {
3325 const pcre_uchar *save_end_subject = end_subject;
3326
3327 /* If firstline is TRUE, the start of the match is constrained to the first
3328 line of a multiline string. Implement this by temporarily adjusting
3329 end_subject so that we stop scanning at a newline. If the match fails at
3330 the newline, later code breaks this loop. */
3331
3332 if (firstline)
3333 {
3334 PCRE_PUCHAR t = current_subject;
3335 #ifdef SUPPORT_UTF
3336 if (utf)
3337 {
3338 while (t < md->end_subject && !IS_NEWLINE(t))
3339 {
3340 t++;
3341 ACROSSCHAR(t < end_subject, *t, t++);
3342 }
3343 }
3344 else
3345 #endif
3346 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3347 end_subject = t;
3348 }
3349
3350 /* There are some optimizations that avoid running the match if a known
3351 starting point is not found. However, there is an option that disables
3352 these, for testing and for ensuring that all callouts do actually occur.
3353 The option can be set in the regex by (*NO_START_OPT) or passed in
3354 match-time options. */
3355
3356 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3357 {
3358 /* Advance to a known first char. */
3359
3360 if (has_first_char)
3361 {
3362 if (first_char != first_char2)
3363 while (current_subject < end_subject &&
3364 *current_subject != first_char && *current_subject != first_char2)
3365 current_subject++;
3366 else
3367 while (current_subject < end_subject &&
3368 *current_subject != first_char)
3369 current_subject++;
3370 }
3371
3372 /* Or to just after a linebreak for a multiline match if possible */
3373
3374 else if (startline)
3375 {
3376 if (current_subject > md->start_subject + start_offset)
3377 {
3378 #ifdef SUPPORT_UTF
3379 if (utf)
3380 {
3381 while (current_subject < end_subject &&
3382 !WAS_NEWLINE(current_subject))
3383 {
3384 current_subject++;
3385 ACROSSCHAR(current_subject < end_subject, *current_subject,
3386 current_subject++);
3387 }
3388 }
3389 else
3390 #endif
3391 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3392 current_subject++;
3393
3394 /* If we have just passed a CR and the newline option is ANY or
3395 ANYCRLF, and we are now at a LF, advance the match position by one
3396 more character. */
3397
3398 if (current_subject[-1] == CHAR_CR &&
3399 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3400 current_subject < end_subject &&
3401 *current_subject == CHAR_NL)
3402 current_subject++;
3403 }
3404 }
3405
3406 /* Or to a non-unique first char after study */
3407
3408 else if (start_bits != NULL)
3409 {
3410 while (current_subject < end_subject)
3411 {
3412 register unsigned int c = *current_subject;
3413 #ifndef COMPILE_PCRE8
3414 if (c > 255) c = 255;
3415 #endif
3416 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3417 {
3418 current_subject++;
3419 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3420 /* In non 8-bit mode, the iteration will stop for
3421 characters > 255 at the beginning or not stop at all. */
3422 if (utf)
3423 ACROSSCHAR(current_subject < end_subject, *current_subject,
3424 current_subject++);
3425 #endif
3426 }
3427 else break;
3428 }
3429 }
3430 }
3431
3432 /* Restore fudged end_subject */
3433
3434 end_subject = save_end_subject;
3435
3436 /* The following two optimizations are disabled for partial matching or if
3437 disabling is explicitly requested (and of course, by the test above, this
3438 code is not obeyed when restarting after a partial match). */
3439
3440 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3441 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3442 {
3443 /* If the pattern was studied, a minimum subject length may be set. This
3444 is a lower bound; no actual string of that length may actually match the
3445 pattern. Although the value is, strictly, in characters, we treat it as
3446 bytes to avoid spending too much time in this optimization. */
3447
3448 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3449 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3450 return PCRE_ERROR_NOMATCH;
3451
3452 /* If req_char is set, we know that that character must appear in the
3453 subject for the match to succeed. If the first character is set, req_char
3454 must be later in the subject; otherwise the test starts at the match
3455 point. This optimization can save a huge amount of work in patterns with
3456 nested unlimited repeats that aren't going to match. Writing separate
3457 code for cased/caseless versions makes it go faster, as does using an
3458 autoincrement and backing off on a match.
3459
3460 HOWEVER: when the subject string is very, very long, searching to its end
3461 can take a long time, and give bad performance on quite ordinary
3462 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3463 string... so we don't do this when the string is sufficiently long. */
3464
3465 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3466 {
3467 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3468
3469 /* We don't need to repeat the search if we haven't yet reached the
3470 place we found it at last time. */
3471
3472 if (p > req_char_ptr)
3473 {
3474 if (req_char != req_char2)
3475 {
3476 while (p < end_subject)
3477 {
3478 register int pp = *p++;
3479 if (pp == req_char || pp == req_char2) { p--; break; }
3480 }
3481 }
3482 else
3483 {
3484 while (p < end_subject)
3485 {
3486 if (*p++ == req_char) { p--; break; }
3487 }
3488 }
3489
3490 /* If we can't find the required character, break the matching loop,
3491 which will cause a return or PCRE_ERROR_NOMATCH. */
3492
3493 if (p >= end_subject) break;
3494
3495 /* If we have found the required character, save the point where we
3496 found it, so that we don't search again next time round the loop if
3497 the start hasn't passed this character yet. */
3498
3499 req_char_ptr = p;
3500 }
3501 }
3502 }
3503 } /* End of optimizations that are done when not restarting */
3504
3505 /* OK, now we can do the business */
3506
3507 md->start_used_ptr = current_subject;
3508 md->recursive = NULL;
3509
3510 rc = internal_dfa_exec(
3511 md, /* fixed match data */
3512 md->start_code, /* this subexpression's code */
3513 current_subject, /* where we currently are */
3514 start_offset, /* start offset in subject */
3515 offsets, /* offset vector */
3516 offsetcount, /* size of same */
3517 workspace, /* workspace vector */
3518 wscount, /* size of same */
3519 0); /* function recurse level */
3520
3521 /* Anything other than "no match" means we are done, always; otherwise, carry
3522 on only if not anchored. */
3523
3524 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3525
3526 /* Advance to the next subject character unless we are at the end of a line
3527 and firstline is set. */
3528
3529 if (firstline && IS_NEWLINE(current_subject)) break;
3530 current_subject++;
3531 #ifdef SUPPORT_UTF
3532 if (utf)
3533 {
3534 ACROSSCHAR(current_subject < end_subject, *current_subject,
3535 current_subject++);
3536 }
3537 #endif
3538 if (current_subject > end_subject) break;
3539
3540 /* If we have just passed a CR and we are now at a LF, and the pattern does
3541 not contain any explicit matches for \r or \n, and the newline option is CRLF
3542 or ANY or ANYCRLF, advance the match position by one more character. */
3543
3544 if (current_subject[-1] == CHAR_CR &&
3545 current_subject < end_subject &&
3546 *current_subject == CHAR_NL &&
3547 (re->flags & PCRE_HASCRORLF) == 0 &&
3548 (md->nltype == NLTYPE_ANY ||
3549 md->nltype == NLTYPE_ANYCRLF ||
3550 md->nllen == 2))
3551 current_subject++;
3552
3553 } /* "Bumpalong" loop */
3554
3555 return PCRE_ERROR_NOMATCH;
3556 }
3557
3558 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5