/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 916 - (show annotations)
Wed Feb 15 09:50:53 2012 UTC (7 years, 9 months ago) by ph10
File MIME type: text/plain
File size: 123536 byte(s)
Error occurred while calculating annotation data.
Fix several partial matching bugs for backrefs, \R, \X, and CRLF line endings. 
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41
42 /* This module contains the external function pcre_dfa_exec(), which is an
43 alternative matching function that uses a sort of DFA algorithm (not a true
44 FSM). This is NOT Perl- compatible, but it has advantages in certain
45 applications. */
46
47
48 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
49 the performance of his patterns greatly. I could not use it as it stood, as it
50 was not thread safe, and made assumptions about pattern sizes. Also, it caused
51 test 7 to loop, and test 9 to crash with a segfault.
52
53 The issue is the check for duplicate states, which is done by a simple linear
54 search up the state list. (Grep for "duplicate" below to find the code.) For
55 many patterns, there will never be many states active at one time, so a simple
56 linear search is fine. In patterns that have many active states, it might be a
57 bottleneck. The suggested code used an indexing scheme to remember which states
58 had previously been used for each character, and avoided the linear search when
59 it knew there was no chance of a duplicate. This was implemented when adding
60 states to the state lists.
61
62 I wrote some thread-safe, not-limited code to try something similar at the time
63 of checking for duplicates (instead of when adding states), using index vectors
64 on the stack. It did give a 13% improvement with one specially constructed
65 pattern for certain subject strings, but on other strings and on many of the
66 simpler patterns in the test suite it did worse. The major problem, I think,
67 was the extra time to initialize the index. This had to be done for each call
68 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
69 only once - I suspect this was the cause of the problems with the tests.)
70
71 Overall, I concluded that the gains in some cases did not outweigh the losses
72 in others, so I abandoned this code. */
73
74
75
76 #ifdef HAVE_CONFIG_H
77 #include "config.h"
78 #endif
79
80 #define NLBLOCK md /* Block containing newline information */
81 #define PSSTART start_subject /* Field containing processed string start */
82 #define PSEND end_subject /* Field containing processed string end */
83
84 #include "pcre_internal.h"
85
86
87 /* For use to indent debugging output */
88
89 #define SP " "
90
91
92 /*************************************************
93 * Code parameters and static tables *
94 *************************************************/
95
96 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
97 into others, under special conditions. A gap of 20 between the blocks should be
98 enough. The resulting opcodes don't have to be less than 256 because they are
99 never stored, so we push them well clear of the normal opcodes. */
100
101 #define OP_PROP_EXTRA 300
102 #define OP_EXTUNI_EXTRA 320
103 #define OP_ANYNL_EXTRA 340
104 #define OP_HSPACE_EXTRA 360
105 #define OP_VSPACE_EXTRA 380
106
107
108 /* This table identifies those opcodes that are followed immediately by a
109 character that is to be tested in some way. This makes it possible to
110 centralize the loading of these characters. In the case of Type * etc, the
111 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
112 small value. Non-zero values in the table are the offsets from the opcode where
113 the character is to be found. ***NOTE*** If the start of this table is
114 modified, the three tables that follow must also be modified. */
115
116 static const pcre_uint8 coptable[] = {
117 0, /* End */
118 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
119 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
120 0, 0, 0, /* Any, AllAny, Anybyte */
121 0, 0, /* \P, \p */
122 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
123 0, /* \X */
124 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
125 1, /* Char */
126 1, /* Chari */
127 1, /* not */
128 1, /* noti */
129 /* Positive single-char repeats */
130 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
131 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
132 1+IMM2_SIZE, /* exact */
133 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
134 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
135 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
136 1+IMM2_SIZE, /* exact I */
137 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
138 /* Negative single-char repeats - only for chars < 256 */
139 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
140 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
141 1+IMM2_SIZE, /* NOT exact */
142 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
143 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
144 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
145 1+IMM2_SIZE, /* NOT exact I */
146 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
147 /* Positive type repeats */
148 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
149 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
150 1+IMM2_SIZE, /* Type exact */
151 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
152 /* Character class & ref repeats */
153 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
154 0, 0, /* CRRANGE, CRMINRANGE */
155 0, /* CLASS */
156 0, /* NCLASS */
157 0, /* XCLASS - variable length */
158 0, /* REF */
159 0, /* REFI */
160 0, /* RECURSE */
161 0, /* CALLOUT */
162 0, /* Alt */
163 0, /* Ket */
164 0, /* KetRmax */
165 0, /* KetRmin */
166 0, /* KetRpos */
167 0, /* Reverse */
168 0, /* Assert */
169 0, /* Assert not */
170 0, /* Assert behind */
171 0, /* Assert behind not */
172 0, 0, /* ONCE, ONCE_NC */
173 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
174 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
175 0, 0, /* CREF, NCREF */
176 0, 0, /* RREF, NRREF */
177 0, /* DEF */
178 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
179 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
180 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
181 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
182 0, 0 /* CLOSE, SKIPZERO */
183 };
184
185 /* This table identifies those opcodes that inspect a character. It is used to
186 remember the fact that a character could have been inspected when the end of
187 the subject is reached. ***NOTE*** If the start of this table is modified, the
188 two tables that follow must also be modified. */
189
190 static const pcre_uint8 poptable[] = {
191 0, /* End */
192 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
193 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
194 1, 1, 1, /* Any, AllAny, Anybyte */
195 1, 1, /* \P, \p */
196 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
197 1, /* \X */
198 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
199 1, /* Char */
200 1, /* Chari */
201 1, /* not */
202 1, /* noti */
203 /* Positive single-char repeats */
204 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
205 1, 1, 1, /* upto, minupto, exact */
206 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
207 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
208 1, 1, 1, /* upto I, minupto I, exact I */
209 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
210 /* Negative single-char repeats - only for chars < 256 */
211 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
212 1, 1, 1, /* NOT upto, minupto, exact */
213 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
214 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
215 1, 1, 1, /* NOT upto I, minupto I, exact I */
216 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
217 /* Positive type repeats */
218 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
219 1, 1, 1, /* Type upto, minupto, exact */
220 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
221 /* Character class & ref repeats */
222 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
223 1, 1, /* CRRANGE, CRMINRANGE */
224 1, /* CLASS */
225 1, /* NCLASS */
226 1, /* XCLASS - variable length */
227 0, /* REF */
228 0, /* REFI */
229 0, /* RECURSE */
230 0, /* CALLOUT */
231 0, /* Alt */
232 0, /* Ket */
233 0, /* KetRmax */
234 0, /* KetRmin */
235 0, /* KetRpos */
236 0, /* Reverse */
237 0, /* Assert */
238 0, /* Assert not */
239 0, /* Assert behind */
240 0, /* Assert behind not */
241 0, 0, /* ONCE, ONCE_NC */
242 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
243 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
244 0, 0, /* CREF, NCREF */
245 0, 0, /* RREF, NRREF */
246 0, /* DEF */
247 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
248 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
249 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
250 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
251 0, 0 /* CLOSE, SKIPZERO */
252 };
253
254 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
255 and \w */
256
257 static const pcre_uint8 toptable1[] = {
258 0, 0, 0, 0, 0, 0,
259 ctype_digit, ctype_digit,
260 ctype_space, ctype_space,
261 ctype_word, ctype_word,
262 0, 0 /* OP_ANY, OP_ALLANY */
263 };
264
265 static const pcre_uint8 toptable2[] = {
266 0, 0, 0, 0, 0, 0,
267 ctype_digit, 0,
268 ctype_space, 0,
269 ctype_word, 0,
270 1, 1 /* OP_ANY, OP_ALLANY */
271 };
272
273
274 /* Structure for holding data about a particular state, which is in effect the
275 current data for an active path through the match tree. It must consist
276 entirely of ints because the working vector we are passed, and which we put
277 these structures in, is a vector of ints. */
278
279 typedef struct stateblock {
280 int offset; /* Offset to opcode */
281 int count; /* Count for repeats */
282 int data; /* Some use extra data */
283 } stateblock;
284
285 #define INTS_PER_STATEBLOCK (sizeof(stateblock)/sizeof(int))
286
287
288 #ifdef PCRE_DEBUG
289 /*************************************************
290 * Print character string *
291 *************************************************/
292
293 /* Character string printing function for debugging.
294
295 Arguments:
296 p points to string
297 length number of bytes
298 f where to print
299
300 Returns: nothing
301 */
302
303 static void
304 pchars(const pcre_uchar *p, int length, FILE *f)
305 {
306 int c;
307 while (length-- > 0)
308 {
309 if (isprint(c = *(p++)))
310 fprintf(f, "%c", c);
311 else
312 fprintf(f, "\\x%02x", c);
313 }
314 }
315 #endif
316
317
318
319 /*************************************************
320 * Execute a Regular Expression - DFA engine *
321 *************************************************/
322
323 /* This internal function applies a compiled pattern to a subject string,
324 starting at a given point, using a DFA engine. This function is called from the
325 external one, possibly multiple times if the pattern is not anchored. The
326 function calls itself recursively for some kinds of subpattern.
327
328 Arguments:
329 md the match_data block with fixed information
330 this_start_code the opening bracket of this subexpression's code
331 current_subject where we currently are in the subject string
332 start_offset start offset in the subject string
333 offsets vector to contain the matching string offsets
334 offsetcount size of same
335 workspace vector of workspace
336 wscount size of same
337 rlevel function call recursion level
338
339 Returns: > 0 => number of match offset pairs placed in offsets
340 = 0 => offsets overflowed; longest matches are present
341 -1 => failed to match
342 < -1 => some kind of unexpected problem
343
344 The following macros are used for adding states to the two state vectors (one
345 for the current character, one for the following character). */
346
347 #define ADD_ACTIVE(x,y) \
348 if (active_count++ < wscount) \
349 { \
350 next_active_state->offset = (x); \
351 next_active_state->count = (y); \
352 next_active_state++; \
353 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
354 } \
355 else return PCRE_ERROR_DFA_WSSIZE
356
357 #define ADD_ACTIVE_DATA(x,y,z) \
358 if (active_count++ < wscount) \
359 { \
360 next_active_state->offset = (x); \
361 next_active_state->count = (y); \
362 next_active_state->data = (z); \
363 next_active_state++; \
364 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
365 } \
366 else return PCRE_ERROR_DFA_WSSIZE
367
368 #define ADD_NEW(x,y) \
369 if (new_count++ < wscount) \
370 { \
371 next_new_state->offset = (x); \
372 next_new_state->count = (y); \
373 next_new_state++; \
374 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
375 } \
376 else return PCRE_ERROR_DFA_WSSIZE
377
378 #define ADD_NEW_DATA(x,y,z) \
379 if (new_count++ < wscount) \
380 { \
381 next_new_state->offset = (x); \
382 next_new_state->count = (y); \
383 next_new_state->data = (z); \
384 next_new_state++; \
385 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 BOOL reset_could_continue = FALSE;
428
429 rlevel++;
430 offsetcount &= (-2);
431
432 wscount -= 2;
433 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
434 (2 * INTS_PER_STATEBLOCK);
435
436 DPRINTF(("\n%.*s---------------------\n"
437 "%.*sCall to internal_dfa_exec f=%d\n",
438 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
439
440 ctypes = md->tables + ctypes_offset;
441 lcc = md->tables + lcc_offset;
442 fcc = md->tables + fcc_offset;
443
444 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
445
446 active_states = (stateblock *)(workspace + 2);
447 next_new_state = new_states = active_states + wscount;
448 new_count = 0;
449
450 first_op = this_start_code + 1 + LINK_SIZE +
451 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
452 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
453 ? IMM2_SIZE:0);
454
455 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
456 the alternative states onto the list, and find out where the end is. This
457 makes is possible to use this function recursively, when we want to stop at a
458 matching internal ket rather than at the end.
459
460 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
461 a backward assertion. In that case, we have to find out the maximum amount to
462 move back, and set up each alternative appropriately. */
463
464 if (*first_op == OP_REVERSE)
465 {
466 int max_back = 0;
467 int gone_back;
468
469 end_code = this_start_code;
470 do
471 {
472 int back = GET(end_code, 2+LINK_SIZE);
473 if (back > max_back) max_back = back;
474 end_code += GET(end_code, 1);
475 }
476 while (*end_code == OP_ALT);
477
478 /* If we can't go back the amount required for the longest lookbehind
479 pattern, go back as far as we can; some alternatives may still be viable. */
480
481 #ifdef SUPPORT_UTF
482 /* In character mode we have to step back character by character */
483
484 if (utf)
485 {
486 for (gone_back = 0; gone_back < max_back; gone_back++)
487 {
488 if (current_subject <= start_subject) break;
489 current_subject--;
490 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
491 }
492 }
493 else
494 #endif
495
496 /* In byte-mode we can do this quickly. */
497
498 {
499 gone_back = (current_subject - max_back < start_subject)?
500 (int)(current_subject - start_subject) : max_back;
501 current_subject -= gone_back;
502 }
503
504 /* Save the earliest consulted character */
505
506 if (current_subject < md->start_used_ptr)
507 md->start_used_ptr = current_subject;
508
509 /* Now we can process the individual branches. */
510
511 end_code = this_start_code;
512 do
513 {
514 int back = GET(end_code, 2+LINK_SIZE);
515 if (back <= gone_back)
516 {
517 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
518 ADD_NEW_DATA(-bstate, 0, gone_back - back);
519 }
520 end_code += GET(end_code, 1);
521 }
522 while (*end_code == OP_ALT);
523 }
524
525 /* This is the code for a "normal" subpattern (not a backward assertion). The
526 start of a whole pattern is always one of these. If we are at the top level,
527 we may be asked to restart matching from the same point that we reached for a
528 previous partial match. We still have to scan through the top-level branches to
529 find the end state. */
530
531 else
532 {
533 end_code = this_start_code;
534
535 /* Restarting */
536
537 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
538 {
539 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
540 new_count = workspace[1];
541 if (!workspace[0])
542 memcpy(new_states, active_states, new_count * sizeof(stateblock));
543 }
544
545 /* Not restarting */
546
547 else
548 {
549 int length = 1 + LINK_SIZE +
550 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
551 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
552 ? IMM2_SIZE:0);
553 do
554 {
555 ADD_NEW((int)(end_code - start_code + length), 0);
556 end_code += GET(end_code, 1);
557 length = 1 + LINK_SIZE;
558 }
559 while (*end_code == OP_ALT);
560 }
561 }
562
563 workspace[0] = 0; /* Bit indicating which vector is current */
564
565 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
566
567 /* Loop for scanning the subject */
568
569 ptr = current_subject;
570 for (;;)
571 {
572 int i, j;
573 int clen, dlen;
574 unsigned int c, d;
575 int forced_fail = 0;
576 BOOL partial_newline = FALSE;
577 BOOL could_continue = reset_could_continue;
578 reset_could_continue = FALSE;
579
580 /* Make the new state list into the active state list and empty the
581 new state list. */
582
583 temp_states = active_states;
584 active_states = new_states;
585 new_states = temp_states;
586 active_count = new_count;
587 new_count = 0;
588
589 workspace[0] ^= 1; /* Remember for the restarting feature */
590 workspace[1] = active_count;
591
592 #ifdef PCRE_DEBUG
593 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
594 pchars(ptr, STRLEN_UC(ptr), stdout);
595 printf("\"\n");
596
597 printf("%.*sActive states: ", rlevel*2-2, SP);
598 for (i = 0; i < active_count; i++)
599 printf("%d/%d ", active_states[i].offset, active_states[i].count);
600 printf("\n");
601 #endif
602
603 /* Set the pointers for adding new states */
604
605 next_active_state = active_states + active_count;
606 next_new_state = new_states;
607
608 /* Load the current character from the subject outside the loop, as many
609 different states may want to look at it, and we assume that at least one
610 will. */
611
612 if (ptr < end_subject)
613 {
614 clen = 1; /* Number of bytes in the character */
615 #ifdef SUPPORT_UTF
616 if (utf) { GETCHARLEN(c, ptr, clen); } else
617 #endif /* SUPPORT_UTF */
618 c = *ptr;
619 }
620 else
621 {
622 clen = 0; /* This indicates the end of the subject */
623 c = NOTACHAR; /* This value should never actually be used */
624 }
625
626 /* Scan up the active states and act on each one. The result of an action
627 may be to add more states to the currently active list (e.g. on hitting a
628 parenthesis) or it may be to put states on the new list, for considering
629 when we move the character pointer on. */
630
631 for (i = 0; i < active_count; i++)
632 {
633 stateblock *current_state = active_states + i;
634 BOOL caseless = FALSE;
635 const pcre_uchar *code;
636 int state_offset = current_state->offset;
637 int count, codevalue, rrc;
638
639 #ifdef PCRE_DEBUG
640 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
641 if (clen == 0) printf("EOL\n");
642 else if (c > 32 && c < 127) printf("'%c'\n", c);
643 else printf("0x%02x\n", c);
644 #endif
645
646 /* A negative offset is a special case meaning "hold off going to this
647 (negated) state until the number of characters in the data field have
648 been skipped". If the could_continue flag was passed over from a previous
649 state, arrange for it to passed on. */
650
651 if (state_offset < 0)
652 {
653 if (current_state->data > 0)
654 {
655 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
656 ADD_NEW_DATA(state_offset, current_state->count,
657 current_state->data - 1);
658 if (could_continue) reset_could_continue = TRUE;
659 continue;
660 }
661 else
662 {
663 current_state->offset = state_offset = -state_offset;
664 }
665 }
666
667 /* Check for a duplicate state with the same count, and skip if found.
668 See the note at the head of this module about the possibility of improving
669 performance here. */
670
671 for (j = 0; j < i; j++)
672 {
673 if (active_states[j].offset == state_offset &&
674 active_states[j].count == current_state->count)
675 {
676 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
677 goto NEXT_ACTIVE_STATE;
678 }
679 }
680
681 /* The state offset is the offset to the opcode */
682
683 code = start_code + state_offset;
684 codevalue = *code;
685
686 /* If this opcode inspects a character, but we are at the end of the
687 subject, remember the fact for use when testing for a partial match. */
688
689 if (clen == 0 && poptable[codevalue] != 0)
690 could_continue = TRUE;
691
692 /* If this opcode is followed by an inline character, load it. It is
693 tempting to test for the presence of a subject character here, but that
694 is wrong, because sometimes zero repetitions of the subject are
695 permitted.
696
697 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
698 argument that is not a data character - but is always one byte long. We
699 have to take special action to deal with \P, \p, \H, \h, \V, \v and \X in
700 this case. To keep the other cases fast, convert these ones to new opcodes.
701 */
702
703 if (coptable[codevalue] > 0)
704 {
705 dlen = 1;
706 #ifdef SUPPORT_UTF
707 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
708 #endif /* SUPPORT_UTF */
709 d = code[coptable[codevalue]];
710 if (codevalue >= OP_TYPESTAR)
711 {
712 switch(d)
713 {
714 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
715 case OP_NOTPROP:
716 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
717 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
718 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
719 case OP_NOT_HSPACE:
720 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
721 case OP_NOT_VSPACE:
722 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
723 default: break;
724 }
725 }
726 }
727 else
728 {
729 dlen = 0; /* Not strictly necessary, but compilers moan */
730 d = NOTACHAR; /* if these variables are not set. */
731 }
732
733
734 /* Now process the individual opcodes */
735
736 switch (codevalue)
737 {
738 /* ========================================================================== */
739 /* These cases are never obeyed. This is a fudge that causes a compile-
740 time error if the vectors coptable or poptable, which are indexed by
741 opcode, are not the correct length. It seems to be the only way to do
742 such a check at compile time, as the sizeof() operator does not work
743 in the C preprocessor. */
744
745 case OP_TABLE_LENGTH:
746 case OP_TABLE_LENGTH +
747 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
748 (sizeof(poptable) == OP_TABLE_LENGTH)):
749 break;
750
751 /* ========================================================================== */
752 /* Reached a closing bracket. If not at the end of the pattern, carry
753 on with the next opcode. For repeating opcodes, also add the repeat
754 state. Note that KETRPOS will always be encountered at the end of the
755 subpattern, because the possessive subpattern repeats are always handled
756 using recursive calls. Thus, it never adds any new states.
757
758 At the end of the (sub)pattern, unless we have an empty string and
759 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
760 start of the subject, save the match data, shifting up all previous
761 matches so we always have the longest first. */
762
763 case OP_KET:
764 case OP_KETRMIN:
765 case OP_KETRMAX:
766 case OP_KETRPOS:
767 if (code != end_code)
768 {
769 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
770 if (codevalue != OP_KET)
771 {
772 ADD_ACTIVE(state_offset - GET(code, 1), 0);
773 }
774 }
775 else
776 {
777 if (ptr > current_subject ||
778 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
779 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
780 current_subject > start_subject + md->start_offset)))
781 {
782 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
783 else if (match_count > 0 && ++match_count * 2 > offsetcount)
784 match_count = 0;
785 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
786 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
787 if (offsetcount >= 2)
788 {
789 offsets[0] = (int)(current_subject - start_subject);
790 offsets[1] = (int)(ptr - start_subject);
791 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
792 offsets[1] - offsets[0], current_subject));
793 }
794 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
795 {
796 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
797 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
798 match_count, rlevel*2-2, SP));
799 return match_count;
800 }
801 }
802 }
803 break;
804
805 /* ========================================================================== */
806 /* These opcodes add to the current list of states without looking
807 at the current character. */
808
809 /*-----------------------------------------------------------------*/
810 case OP_ALT:
811 do { code += GET(code, 1); } while (*code == OP_ALT);
812 ADD_ACTIVE((int)(code - start_code), 0);
813 break;
814
815 /*-----------------------------------------------------------------*/
816 case OP_BRA:
817 case OP_SBRA:
818 do
819 {
820 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
821 code += GET(code, 1);
822 }
823 while (*code == OP_ALT);
824 break;
825
826 /*-----------------------------------------------------------------*/
827 case OP_CBRA:
828 case OP_SCBRA:
829 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
830 code += GET(code, 1);
831 while (*code == OP_ALT)
832 {
833 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
834 code += GET(code, 1);
835 }
836 break;
837
838 /*-----------------------------------------------------------------*/
839 case OP_BRAZERO:
840 case OP_BRAMINZERO:
841 ADD_ACTIVE(state_offset + 1, 0);
842 code += 1 + GET(code, 2);
843 while (*code == OP_ALT) code += GET(code, 1);
844 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
845 break;
846
847 /*-----------------------------------------------------------------*/
848 case OP_SKIPZERO:
849 code += 1 + GET(code, 2);
850 while (*code == OP_ALT) code += GET(code, 1);
851 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
852 break;
853
854 /*-----------------------------------------------------------------*/
855 case OP_CIRC:
856 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
857 { ADD_ACTIVE(state_offset + 1, 0); }
858 break;
859
860 /*-----------------------------------------------------------------*/
861 case OP_CIRCM:
862 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
863 (ptr != end_subject && WAS_NEWLINE(ptr)))
864 { ADD_ACTIVE(state_offset + 1, 0); }
865 break;
866
867 /*-----------------------------------------------------------------*/
868 case OP_EOD:
869 if (ptr >= end_subject)
870 {
871 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
872 could_continue = TRUE;
873 else { ADD_ACTIVE(state_offset + 1, 0); }
874 }
875 break;
876
877 /*-----------------------------------------------------------------*/
878 case OP_SOD:
879 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
880 break;
881
882 /*-----------------------------------------------------------------*/
883 case OP_SOM:
884 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
885 break;
886
887
888 /* ========================================================================== */
889 /* These opcodes inspect the next subject character, and sometimes
890 the previous one as well, but do not have an argument. The variable
891 clen contains the length of the current character and is zero if we are
892 at the end of the subject. */
893
894 /*-----------------------------------------------------------------*/
895 case OP_ANY:
896 if (clen > 0 && !IS_NEWLINE(ptr))
897 { ADD_NEW(state_offset + 1, 0); }
898 break;
899
900 /*-----------------------------------------------------------------*/
901 case OP_ALLANY:
902 if (clen > 0)
903 { ADD_NEW(state_offset + 1, 0); }
904 break;
905
906 /*-----------------------------------------------------------------*/
907 case OP_EODN:
908 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
909 could_continue = TRUE;
910 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
911 { ADD_ACTIVE(state_offset + 1, 0); }
912 break;
913
914 /*-----------------------------------------------------------------*/
915 case OP_DOLL:
916 if ((md->moptions & PCRE_NOTEOL) == 0)
917 {
918 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
919 could_continue = TRUE;
920 else if (clen == 0 ||
921 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
922 (ptr == end_subject - md->nllen)
923 ))
924 { ADD_ACTIVE(state_offset + 1, 0); }
925 else if (ptr + 1 >= md->end_subject &&
926 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
927 NLBLOCK->nltype == NLTYPE_FIXED &&
928 NLBLOCK->nllen == 2 &&
929 c == NLBLOCK->nl[0])
930 {
931 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
932 {
933 reset_could_continue = TRUE;
934 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
935 }
936 else could_continue = partial_newline = TRUE;
937 }
938 }
939 break;
940
941 /*-----------------------------------------------------------------*/
942 case OP_DOLLM:
943 if ((md->moptions & PCRE_NOTEOL) == 0)
944 {
945 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
946 could_continue = TRUE;
947 else if (clen == 0 ||
948 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
949 { ADD_ACTIVE(state_offset + 1, 0); }
950 else if (ptr + 1 >= md->end_subject &&
951 (md->moptions & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) != 0 &&
952 NLBLOCK->nltype == NLTYPE_FIXED &&
953 NLBLOCK->nllen == 2 &&
954 c == NLBLOCK->nl[0])
955 {
956 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
957 {
958 reset_could_continue = TRUE;
959 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
960 }
961 else could_continue = partial_newline = TRUE;
962 }
963 }
964 else if (IS_NEWLINE(ptr))
965 { ADD_ACTIVE(state_offset + 1, 0); }
966 break;
967
968 /*-----------------------------------------------------------------*/
969
970 case OP_DIGIT:
971 case OP_WHITESPACE:
972 case OP_WORDCHAR:
973 if (clen > 0 && c < 256 &&
974 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
975 { ADD_NEW(state_offset + 1, 0); }
976 break;
977
978 /*-----------------------------------------------------------------*/
979 case OP_NOT_DIGIT:
980 case OP_NOT_WHITESPACE:
981 case OP_NOT_WORDCHAR:
982 if (clen > 0 && (c >= 256 ||
983 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
984 { ADD_NEW(state_offset + 1, 0); }
985 break;
986
987 /*-----------------------------------------------------------------*/
988 case OP_WORD_BOUNDARY:
989 case OP_NOT_WORD_BOUNDARY:
990 {
991 int left_word, right_word;
992
993 if (ptr > start_subject)
994 {
995 const pcre_uchar *temp = ptr - 1;
996 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
997 #ifdef SUPPORT_UTF
998 if (utf) { BACKCHAR(temp); }
999 #endif
1000 GETCHARTEST(d, temp);
1001 #ifdef SUPPORT_UCP
1002 if ((md->poptions & PCRE_UCP) != 0)
1003 {
1004 if (d == '_') left_word = TRUE; else
1005 {
1006 int cat = UCD_CATEGORY(d);
1007 left_word = (cat == ucp_L || cat == ucp_N);
1008 }
1009 }
1010 else
1011 #endif
1012 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
1013 }
1014 else left_word = FALSE;
1015
1016 if (clen > 0)
1017 {
1018 #ifdef SUPPORT_UCP
1019 if ((md->poptions & PCRE_UCP) != 0)
1020 {
1021 if (c == '_') right_word = TRUE; else
1022 {
1023 int cat = UCD_CATEGORY(c);
1024 right_word = (cat == ucp_L || cat == ucp_N);
1025 }
1026 }
1027 else
1028 #endif
1029 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
1030 }
1031 else right_word = FALSE;
1032
1033 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1034 { ADD_ACTIVE(state_offset + 1, 0); }
1035 }
1036 break;
1037
1038
1039 /*-----------------------------------------------------------------*/
1040 /* Check the next character by Unicode property. We will get here only
1041 if the support is in the binary; otherwise a compile-time error occurs.
1042 */
1043
1044 #ifdef SUPPORT_UCP
1045 case OP_PROP:
1046 case OP_NOTPROP:
1047 if (clen > 0)
1048 {
1049 BOOL OK;
1050 const ucd_record * prop = GET_UCD(c);
1051 switch(code[1])
1052 {
1053 case PT_ANY:
1054 OK = TRUE;
1055 break;
1056
1057 case PT_LAMP:
1058 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1059 prop->chartype == ucp_Lt;
1060 break;
1061
1062 case PT_GC:
1063 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1064 break;
1065
1066 case PT_PC:
1067 OK = prop->chartype == code[2];
1068 break;
1069
1070 case PT_SC:
1071 OK = prop->script == code[2];
1072 break;
1073
1074 /* These are specials for combination cases. */
1075
1076 case PT_ALNUM:
1077 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1078 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1079 break;
1080
1081 case PT_SPACE: /* Perl space */
1082 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1083 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1084 break;
1085
1086 case PT_PXSPACE: /* POSIX space */
1087 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1088 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1089 c == CHAR_FF || c == CHAR_CR;
1090 break;
1091
1092 case PT_WORD:
1093 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1094 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1095 c == CHAR_UNDERSCORE;
1096 break;
1097
1098 /* Should never occur, but keep compilers from grumbling. */
1099
1100 default:
1101 OK = codevalue != OP_PROP;
1102 break;
1103 }
1104
1105 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1106 }
1107 break;
1108 #endif
1109
1110
1111
1112 /* ========================================================================== */
1113 /* These opcodes likewise inspect the subject character, but have an
1114 argument that is not a data character. It is one of these opcodes:
1115 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1116 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1117
1118 case OP_TYPEPLUS:
1119 case OP_TYPEMINPLUS:
1120 case OP_TYPEPOSPLUS:
1121 count = current_state->count; /* Already matched */
1122 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1123 if (clen > 0)
1124 {
1125 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1126 (c < 256 &&
1127 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1128 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1129 {
1130 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1131 {
1132 active_count--; /* Remove non-match possibility */
1133 next_active_state--;
1134 }
1135 count++;
1136 ADD_NEW(state_offset, count);
1137 }
1138 }
1139 break;
1140
1141 /*-----------------------------------------------------------------*/
1142 case OP_TYPEQUERY:
1143 case OP_TYPEMINQUERY:
1144 case OP_TYPEPOSQUERY:
1145 ADD_ACTIVE(state_offset + 2, 0);
1146 if (clen > 0)
1147 {
1148 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1149 (c < 256 &&
1150 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1151 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1152 {
1153 if (codevalue == OP_TYPEPOSQUERY)
1154 {
1155 active_count--; /* Remove non-match possibility */
1156 next_active_state--;
1157 }
1158 ADD_NEW(state_offset + 2, 0);
1159 }
1160 }
1161 break;
1162
1163 /*-----------------------------------------------------------------*/
1164 case OP_TYPESTAR:
1165 case OP_TYPEMINSTAR:
1166 case OP_TYPEPOSSTAR:
1167 ADD_ACTIVE(state_offset + 2, 0);
1168 if (clen > 0)
1169 {
1170 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1171 (c < 256 &&
1172 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1173 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1174 {
1175 if (codevalue == OP_TYPEPOSSTAR)
1176 {
1177 active_count--; /* Remove non-match possibility */
1178 next_active_state--;
1179 }
1180 ADD_NEW(state_offset, 0);
1181 }
1182 }
1183 break;
1184
1185 /*-----------------------------------------------------------------*/
1186 case OP_TYPEEXACT:
1187 count = current_state->count; /* Number already matched */
1188 if (clen > 0)
1189 {
1190 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1191 (c < 256 &&
1192 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1193 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1194 {
1195 if (++count >= GET2(code, 1))
1196 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1197 else
1198 { ADD_NEW(state_offset, count); }
1199 }
1200 }
1201 break;
1202
1203 /*-----------------------------------------------------------------*/
1204 case OP_TYPEUPTO:
1205 case OP_TYPEMINUPTO:
1206 case OP_TYPEPOSUPTO:
1207 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1208 count = current_state->count; /* Number already matched */
1209 if (clen > 0)
1210 {
1211 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1212 (c < 256 &&
1213 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1214 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1215 {
1216 if (codevalue == OP_TYPEPOSUPTO)
1217 {
1218 active_count--; /* Remove non-match possibility */
1219 next_active_state--;
1220 }
1221 if (++count >= GET2(code, 1))
1222 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1223 else
1224 { ADD_NEW(state_offset, count); }
1225 }
1226 }
1227 break;
1228
1229 /* ========================================================================== */
1230 /* These are virtual opcodes that are used when something like
1231 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1232 argument. It keeps the code above fast for the other cases. The argument
1233 is in the d variable. */
1234
1235 #ifdef SUPPORT_UCP
1236 case OP_PROP_EXTRA + OP_TYPEPLUS:
1237 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1238 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1239 count = current_state->count; /* Already matched */
1240 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1241 if (clen > 0)
1242 {
1243 BOOL OK;
1244 const ucd_record * prop = GET_UCD(c);
1245 switch(code[2])
1246 {
1247 case PT_ANY:
1248 OK = TRUE;
1249 break;
1250
1251 case PT_LAMP:
1252 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1253 prop->chartype == ucp_Lt;
1254 break;
1255
1256 case PT_GC:
1257 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1258 break;
1259
1260 case PT_PC:
1261 OK = prop->chartype == code[3];
1262 break;
1263
1264 case PT_SC:
1265 OK = prop->script == code[3];
1266 break;
1267
1268 /* These are specials for combination cases. */
1269
1270 case PT_ALNUM:
1271 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1272 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1273 break;
1274
1275 case PT_SPACE: /* Perl space */
1276 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1277 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1278 break;
1279
1280 case PT_PXSPACE: /* POSIX space */
1281 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1282 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1283 c == CHAR_FF || c == CHAR_CR;
1284 break;
1285
1286 case PT_WORD:
1287 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1288 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1289 c == CHAR_UNDERSCORE;
1290 break;
1291
1292 /* Should never occur, but keep compilers from grumbling. */
1293
1294 default:
1295 OK = codevalue != OP_PROP;
1296 break;
1297 }
1298
1299 if (OK == (d == OP_PROP))
1300 {
1301 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1302 {
1303 active_count--; /* Remove non-match possibility */
1304 next_active_state--;
1305 }
1306 count++;
1307 ADD_NEW(state_offset, count);
1308 }
1309 }
1310 break;
1311
1312 /*-----------------------------------------------------------------*/
1313 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1314 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1315 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1316 count = current_state->count; /* Already matched */
1317 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1318 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1319 {
1320 const pcre_uchar *nptr = ptr + clen;
1321 int ncount = 0;
1322 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1323 {
1324 active_count--; /* Remove non-match possibility */
1325 next_active_state--;
1326 }
1327 while (nptr < end_subject)
1328 {
1329 int nd;
1330 int ndlen = 1;
1331 GETCHARLEN(nd, nptr, ndlen);
1332 if (UCD_CATEGORY(nd) != ucp_M) break;
1333 ncount++;
1334 nptr += ndlen;
1335 }
1336 count++;
1337 ADD_NEW_DATA(-state_offset, count, ncount);
1338 }
1339 break;
1340 #endif
1341
1342 /*-----------------------------------------------------------------*/
1343 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1344 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1345 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1346 count = current_state->count; /* Already matched */
1347 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1348 if (clen > 0)
1349 {
1350 int ncount = 0;
1351 switch (c)
1352 {
1353 case 0x000b:
1354 case 0x000c:
1355 case 0x0085:
1356 case 0x2028:
1357 case 0x2029:
1358 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1359 goto ANYNL01;
1360
1361 case 0x000d:
1362 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1363 /* Fall through */
1364
1365 ANYNL01:
1366 case 0x000a:
1367 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1368 {
1369 active_count--; /* Remove non-match possibility */
1370 next_active_state--;
1371 }
1372 count++;
1373 ADD_NEW_DATA(-state_offset, count, ncount);
1374 break;
1375
1376 default:
1377 break;
1378 }
1379 }
1380 break;
1381
1382 /*-----------------------------------------------------------------*/
1383 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1384 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1385 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1386 count = current_state->count; /* Already matched */
1387 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1388 if (clen > 0)
1389 {
1390 BOOL OK;
1391 switch (c)
1392 {
1393 case 0x000a:
1394 case 0x000b:
1395 case 0x000c:
1396 case 0x000d:
1397 case 0x0085:
1398 case 0x2028:
1399 case 0x2029:
1400 OK = TRUE;
1401 break;
1402
1403 default:
1404 OK = FALSE;
1405 break;
1406 }
1407
1408 if (OK == (d == OP_VSPACE))
1409 {
1410 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1411 {
1412 active_count--; /* Remove non-match possibility */
1413 next_active_state--;
1414 }
1415 count++;
1416 ADD_NEW_DATA(-state_offset, count, 0);
1417 }
1418 }
1419 break;
1420
1421 /*-----------------------------------------------------------------*/
1422 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1423 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1424 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1425 count = current_state->count; /* Already matched */
1426 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1427 if (clen > 0)
1428 {
1429 BOOL OK;
1430 switch (c)
1431 {
1432 case 0x09: /* HT */
1433 case 0x20: /* SPACE */
1434 case 0xa0: /* NBSP */
1435 case 0x1680: /* OGHAM SPACE MARK */
1436 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1437 case 0x2000: /* EN QUAD */
1438 case 0x2001: /* EM QUAD */
1439 case 0x2002: /* EN SPACE */
1440 case 0x2003: /* EM SPACE */
1441 case 0x2004: /* THREE-PER-EM SPACE */
1442 case 0x2005: /* FOUR-PER-EM SPACE */
1443 case 0x2006: /* SIX-PER-EM SPACE */
1444 case 0x2007: /* FIGURE SPACE */
1445 case 0x2008: /* PUNCTUATION SPACE */
1446 case 0x2009: /* THIN SPACE */
1447 case 0x200A: /* HAIR SPACE */
1448 case 0x202f: /* NARROW NO-BREAK SPACE */
1449 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1450 case 0x3000: /* IDEOGRAPHIC SPACE */
1451 OK = TRUE;
1452 break;
1453
1454 default:
1455 OK = FALSE;
1456 break;
1457 }
1458
1459 if (OK == (d == OP_HSPACE))
1460 {
1461 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1462 {
1463 active_count--; /* Remove non-match possibility */
1464 next_active_state--;
1465 }
1466 count++;
1467 ADD_NEW_DATA(-state_offset, count, 0);
1468 }
1469 }
1470 break;
1471
1472 /*-----------------------------------------------------------------*/
1473 #ifdef SUPPORT_UCP
1474 case OP_PROP_EXTRA + OP_TYPEQUERY:
1475 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1476 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1477 count = 4;
1478 goto QS1;
1479
1480 case OP_PROP_EXTRA + OP_TYPESTAR:
1481 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1482 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1483 count = 0;
1484
1485 QS1:
1486
1487 ADD_ACTIVE(state_offset + 4, 0);
1488 if (clen > 0)
1489 {
1490 BOOL OK;
1491 const ucd_record * prop = GET_UCD(c);
1492 switch(code[2])
1493 {
1494 case PT_ANY:
1495 OK = TRUE;
1496 break;
1497
1498 case PT_LAMP:
1499 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1500 prop->chartype == ucp_Lt;
1501 break;
1502
1503 case PT_GC:
1504 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1505 break;
1506
1507 case PT_PC:
1508 OK = prop->chartype == code[3];
1509 break;
1510
1511 case PT_SC:
1512 OK = prop->script == code[3];
1513 break;
1514
1515 /* These are specials for combination cases. */
1516
1517 case PT_ALNUM:
1518 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1519 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1520 break;
1521
1522 case PT_SPACE: /* Perl space */
1523 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1524 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1525 break;
1526
1527 case PT_PXSPACE: /* POSIX space */
1528 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1529 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1530 c == CHAR_FF || c == CHAR_CR;
1531 break;
1532
1533 case PT_WORD:
1534 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1535 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1536 c == CHAR_UNDERSCORE;
1537 break;
1538
1539 /* Should never occur, but keep compilers from grumbling. */
1540
1541 default:
1542 OK = codevalue != OP_PROP;
1543 break;
1544 }
1545
1546 if (OK == (d == OP_PROP))
1547 {
1548 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1549 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1550 {
1551 active_count--; /* Remove non-match possibility */
1552 next_active_state--;
1553 }
1554 ADD_NEW(state_offset + count, 0);
1555 }
1556 }
1557 break;
1558
1559 /*-----------------------------------------------------------------*/
1560 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1561 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1562 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1563 count = 2;
1564 goto QS2;
1565
1566 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1567 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1568 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1569 count = 0;
1570
1571 QS2:
1572
1573 ADD_ACTIVE(state_offset + 2, 0);
1574 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1575 {
1576 const pcre_uchar *nptr = ptr + clen;
1577 int ncount = 0;
1578 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1579 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1580 {
1581 active_count--; /* Remove non-match possibility */
1582 next_active_state--;
1583 }
1584 while (nptr < end_subject)
1585 {
1586 int nd;
1587 int ndlen = 1;
1588 GETCHARLEN(nd, nptr, ndlen);
1589 if (UCD_CATEGORY(nd) != ucp_M) break;
1590 ncount++;
1591 nptr += ndlen;
1592 }
1593 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1594 }
1595 break;
1596 #endif
1597
1598 /*-----------------------------------------------------------------*/
1599 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1600 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1601 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1602 count = 2;
1603 goto QS3;
1604
1605 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1606 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1607 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1608 count = 0;
1609
1610 QS3:
1611 ADD_ACTIVE(state_offset + 2, 0);
1612 if (clen > 0)
1613 {
1614 int ncount = 0;
1615 switch (c)
1616 {
1617 case 0x000b:
1618 case 0x000c:
1619 case 0x0085:
1620 case 0x2028:
1621 case 0x2029:
1622 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1623 goto ANYNL02;
1624
1625 case 0x000d:
1626 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1627 /* Fall through */
1628
1629 ANYNL02:
1630 case 0x000a:
1631 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1632 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1633 {
1634 active_count--; /* Remove non-match possibility */
1635 next_active_state--;
1636 }
1637 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1638 break;
1639
1640 default:
1641 break;
1642 }
1643 }
1644 break;
1645
1646 /*-----------------------------------------------------------------*/
1647 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1648 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1649 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1650 count = 2;
1651 goto QS4;
1652
1653 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1654 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1655 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1656 count = 0;
1657
1658 QS4:
1659 ADD_ACTIVE(state_offset + 2, 0);
1660 if (clen > 0)
1661 {
1662 BOOL OK;
1663 switch (c)
1664 {
1665 case 0x000a:
1666 case 0x000b:
1667 case 0x000c:
1668 case 0x000d:
1669 case 0x0085:
1670 case 0x2028:
1671 case 0x2029:
1672 OK = TRUE;
1673 break;
1674
1675 default:
1676 OK = FALSE;
1677 break;
1678 }
1679 if (OK == (d == OP_VSPACE))
1680 {
1681 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1682 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1683 {
1684 active_count--; /* Remove non-match possibility */
1685 next_active_state--;
1686 }
1687 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1688 }
1689 }
1690 break;
1691
1692 /*-----------------------------------------------------------------*/
1693 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1694 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1695 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1696 count = 2;
1697 goto QS5;
1698
1699 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1700 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1701 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1702 count = 0;
1703
1704 QS5:
1705 ADD_ACTIVE(state_offset + 2, 0);
1706 if (clen > 0)
1707 {
1708 BOOL OK;
1709 switch (c)
1710 {
1711 case 0x09: /* HT */
1712 case 0x20: /* SPACE */
1713 case 0xa0: /* NBSP */
1714 case 0x1680: /* OGHAM SPACE MARK */
1715 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1716 case 0x2000: /* EN QUAD */
1717 case 0x2001: /* EM QUAD */
1718 case 0x2002: /* EN SPACE */
1719 case 0x2003: /* EM SPACE */
1720 case 0x2004: /* THREE-PER-EM SPACE */
1721 case 0x2005: /* FOUR-PER-EM SPACE */
1722 case 0x2006: /* SIX-PER-EM SPACE */
1723 case 0x2007: /* FIGURE SPACE */
1724 case 0x2008: /* PUNCTUATION SPACE */
1725 case 0x2009: /* THIN SPACE */
1726 case 0x200A: /* HAIR SPACE */
1727 case 0x202f: /* NARROW NO-BREAK SPACE */
1728 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1729 case 0x3000: /* IDEOGRAPHIC SPACE */
1730 OK = TRUE;
1731 break;
1732
1733 default:
1734 OK = FALSE;
1735 break;
1736 }
1737
1738 if (OK == (d == OP_HSPACE))
1739 {
1740 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1741 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1742 {
1743 active_count--; /* Remove non-match possibility */
1744 next_active_state--;
1745 }
1746 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1747 }
1748 }
1749 break;
1750
1751 /*-----------------------------------------------------------------*/
1752 #ifdef SUPPORT_UCP
1753 case OP_PROP_EXTRA + OP_TYPEEXACT:
1754 case OP_PROP_EXTRA + OP_TYPEUPTO:
1755 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1756 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1757 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1758 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1759 count = current_state->count; /* Number already matched */
1760 if (clen > 0)
1761 {
1762 BOOL OK;
1763 const ucd_record * prop = GET_UCD(c);
1764 switch(code[1 + IMM2_SIZE + 1])
1765 {
1766 case PT_ANY:
1767 OK = TRUE;
1768 break;
1769
1770 case PT_LAMP:
1771 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1772 prop->chartype == ucp_Lt;
1773 break;
1774
1775 case PT_GC:
1776 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1777 break;
1778
1779 case PT_PC:
1780 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1781 break;
1782
1783 case PT_SC:
1784 OK = prop->script == code[1 + IMM2_SIZE + 2];
1785 break;
1786
1787 /* These are specials for combination cases. */
1788
1789 case PT_ALNUM:
1790 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1791 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1792 break;
1793
1794 case PT_SPACE: /* Perl space */
1795 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1796 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1797 break;
1798
1799 case PT_PXSPACE: /* POSIX space */
1800 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1801 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1802 c == CHAR_FF || c == CHAR_CR;
1803 break;
1804
1805 case PT_WORD:
1806 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1807 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1808 c == CHAR_UNDERSCORE;
1809 break;
1810
1811 /* Should never occur, but keep compilers from grumbling. */
1812
1813 default:
1814 OK = codevalue != OP_PROP;
1815 break;
1816 }
1817
1818 if (OK == (d == OP_PROP))
1819 {
1820 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1821 {
1822 active_count--; /* Remove non-match possibility */
1823 next_active_state--;
1824 }
1825 if (++count >= GET2(code, 1))
1826 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1827 else
1828 { ADD_NEW(state_offset, count); }
1829 }
1830 }
1831 break;
1832
1833 /*-----------------------------------------------------------------*/
1834 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1835 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1836 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1837 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1838 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1839 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1840 count = current_state->count; /* Number already matched */
1841 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1842 {
1843 const pcre_uchar *nptr = ptr + clen;
1844 int ncount = 0;
1845 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1846 {
1847 active_count--; /* Remove non-match possibility */
1848 next_active_state--;
1849 }
1850 while (nptr < end_subject)
1851 {
1852 int nd;
1853 int ndlen = 1;
1854 GETCHARLEN(nd, nptr, ndlen);
1855 if (UCD_CATEGORY(nd) != ucp_M) break;
1856 ncount++;
1857 nptr += ndlen;
1858 }
1859 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
1860 reset_could_continue = TRUE;
1861 if (++count >= GET2(code, 1))
1862 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1863 else
1864 { ADD_NEW_DATA(-state_offset, count, ncount); }
1865 }
1866 break;
1867 #endif
1868
1869 /*-----------------------------------------------------------------*/
1870 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1871 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1872 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1873 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1874 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1875 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1876 count = current_state->count; /* Number already matched */
1877 if (clen > 0)
1878 {
1879 int ncount = 0;
1880 switch (c)
1881 {
1882 case 0x000b:
1883 case 0x000c:
1884 case 0x0085:
1885 case 0x2028:
1886 case 0x2029:
1887 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1888 goto ANYNL03;
1889
1890 case 0x000d:
1891 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1892 /* Fall through */
1893
1894 ANYNL03:
1895 case 0x000a:
1896 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1897 {
1898 active_count--; /* Remove non-match possibility */
1899 next_active_state--;
1900 }
1901 if (++count >= GET2(code, 1))
1902 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1903 else
1904 { ADD_NEW_DATA(-state_offset, count, ncount); }
1905 break;
1906
1907 default:
1908 break;
1909 }
1910 }
1911 break;
1912
1913 /*-----------------------------------------------------------------*/
1914 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1915 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1916 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1917 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1918 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1919 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1920 count = current_state->count; /* Number already matched */
1921 if (clen > 0)
1922 {
1923 BOOL OK;
1924 switch (c)
1925 {
1926 case 0x000a:
1927 case 0x000b:
1928 case 0x000c:
1929 case 0x000d:
1930 case 0x0085:
1931 case 0x2028:
1932 case 0x2029:
1933 OK = TRUE;
1934 break;
1935
1936 default:
1937 OK = FALSE;
1938 }
1939
1940 if (OK == (d == OP_VSPACE))
1941 {
1942 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1943 {
1944 active_count--; /* Remove non-match possibility */
1945 next_active_state--;
1946 }
1947 if (++count >= GET2(code, 1))
1948 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
1949 else
1950 { ADD_NEW_DATA(-state_offset, count, 0); }
1951 }
1952 }
1953 break;
1954
1955 /*-----------------------------------------------------------------*/
1956 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
1957 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
1958 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
1959 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
1960 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
1961 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1962 count = current_state->count; /* Number already matched */
1963 if (clen > 0)
1964 {
1965 BOOL OK;
1966 switch (c)
1967 {
1968 case 0x09: /* HT */
1969 case 0x20: /* SPACE */
1970 case 0xa0: /* NBSP */
1971 case 0x1680: /* OGHAM SPACE MARK */
1972 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1973 case 0x2000: /* EN QUAD */
1974 case 0x2001: /* EM QUAD */
1975 case 0x2002: /* EN SPACE */
1976 case 0x2003: /* EM SPACE */
1977 case 0x2004: /* THREE-PER-EM SPACE */
1978 case 0x2005: /* FOUR-PER-EM SPACE */
1979 case 0x2006: /* SIX-PER-EM SPACE */
1980 case 0x2007: /* FIGURE SPACE */
1981 case 0x2008: /* PUNCTUATION SPACE */
1982 case 0x2009: /* THIN SPACE */
1983 case 0x200A: /* HAIR SPACE */
1984 case 0x202f: /* NARROW NO-BREAK SPACE */
1985 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1986 case 0x3000: /* IDEOGRAPHIC SPACE */
1987 OK = TRUE;
1988 break;
1989
1990 default:
1991 OK = FALSE;
1992 break;
1993 }
1994
1995 if (OK == (d == OP_HSPACE))
1996 {
1997 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
1998 {
1999 active_count--; /* Remove non-match possibility */
2000 next_active_state--;
2001 }
2002 if (++count >= GET2(code, 1))
2003 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
2004 else
2005 { ADD_NEW_DATA(-state_offset, count, 0); }
2006 }
2007 }
2008 break;
2009
2010 /* ========================================================================== */
2011 /* These opcodes are followed by a character that is usually compared
2012 to the current subject character; it is loaded into d. We still get
2013 here even if there is no subject character, because in some cases zero
2014 repetitions are permitted. */
2015
2016 /*-----------------------------------------------------------------*/
2017 case OP_CHAR:
2018 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
2019 break;
2020
2021 /*-----------------------------------------------------------------*/
2022 case OP_CHARI:
2023 if (clen == 0) break;
2024
2025 #ifdef SUPPORT_UTF
2026 if (utf)
2027 {
2028 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
2029 {
2030 unsigned int othercase;
2031 if (c < 128)
2032 othercase = fcc[c];
2033 else
2034 /* If we have Unicode property support, we can use it to test the
2035 other case of the character. */
2036 #ifdef SUPPORT_UCP
2037 othercase = UCD_OTHERCASE(c);
2038 #else
2039 othercase = NOTACHAR;
2040 #endif
2041
2042 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2043 }
2044 }
2045 else
2046 #endif /* SUPPORT_UTF */
2047 /* Not UTF mode */
2048 {
2049 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2050 { ADD_NEW(state_offset + 2, 0); }
2051 }
2052 break;
2053
2054
2055 #ifdef SUPPORT_UCP
2056 /*-----------------------------------------------------------------*/
2057 /* This is a tricky one because it can match more than one character.
2058 Find out how many characters to skip, and then set up a negative state
2059 to wait for them to pass before continuing. */
2060
2061 case OP_EXTUNI:
2062 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
2063 {
2064 const pcre_uchar *nptr = ptr + clen;
2065 int ncount = 0;
2066 while (nptr < end_subject)
2067 {
2068 int nclen = 1;
2069 GETCHARLEN(c, nptr, nclen);
2070 if (UCD_CATEGORY(c) != ucp_M) break;
2071 ncount++;
2072 nptr += nclen;
2073 }
2074 if (nptr >= end_subject && (md->moptions & PCRE_PARTIAL_HARD) != 0)
2075 reset_could_continue = TRUE;
2076 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2077 }
2078 break;
2079 #endif
2080
2081 /*-----------------------------------------------------------------*/
2082 /* This is a tricky like EXTUNI because it too can match more than one
2083 character (when CR is followed by LF). In this case, set up a negative
2084 state to wait for one character to pass before continuing. */
2085
2086 case OP_ANYNL:
2087 if (clen > 0) switch(c)
2088 {
2089 case 0x000b:
2090 case 0x000c:
2091 case 0x0085:
2092 case 0x2028:
2093 case 0x2029:
2094 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2095
2096 case 0x000a:
2097 ADD_NEW(state_offset + 1, 0);
2098 break;
2099
2100 case 0x000d:
2101 if (ptr + 1 >= end_subject)
2102 {
2103 ADD_NEW(state_offset + 1, 0);
2104 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
2105 reset_could_continue = TRUE;
2106 }
2107 else if (ptr[1] == 0x0a)
2108 {
2109 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2110 }
2111 else
2112 {
2113 ADD_NEW(state_offset + 1, 0);
2114 }
2115 break;
2116 }
2117 break;
2118
2119 /*-----------------------------------------------------------------*/
2120 case OP_NOT_VSPACE:
2121 if (clen > 0) switch(c)
2122 {
2123 case 0x000a:
2124 case 0x000b:
2125 case 0x000c:
2126 case 0x000d:
2127 case 0x0085:
2128 case 0x2028:
2129 case 0x2029:
2130 break;
2131
2132 default:
2133 ADD_NEW(state_offset + 1, 0);
2134 break;
2135 }
2136 break;
2137
2138 /*-----------------------------------------------------------------*/
2139 case OP_VSPACE:
2140 if (clen > 0) switch(c)
2141 {
2142 case 0x000a:
2143 case 0x000b:
2144 case 0x000c:
2145 case 0x000d:
2146 case 0x0085:
2147 case 0x2028:
2148 case 0x2029:
2149 ADD_NEW(state_offset + 1, 0);
2150 break;
2151
2152 default: break;
2153 }
2154 break;
2155
2156 /*-----------------------------------------------------------------*/
2157 case OP_NOT_HSPACE:
2158 if (clen > 0) switch(c)
2159 {
2160 case 0x09: /* HT */
2161 case 0x20: /* SPACE */
2162 case 0xa0: /* NBSP */
2163 case 0x1680: /* OGHAM SPACE MARK */
2164 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2165 case 0x2000: /* EN QUAD */
2166 case 0x2001: /* EM QUAD */
2167 case 0x2002: /* EN SPACE */
2168 case 0x2003: /* EM SPACE */
2169 case 0x2004: /* THREE-PER-EM SPACE */
2170 case 0x2005: /* FOUR-PER-EM SPACE */
2171 case 0x2006: /* SIX-PER-EM SPACE */
2172 case 0x2007: /* FIGURE SPACE */
2173 case 0x2008: /* PUNCTUATION SPACE */
2174 case 0x2009: /* THIN SPACE */
2175 case 0x200A: /* HAIR SPACE */
2176 case 0x202f: /* NARROW NO-BREAK SPACE */
2177 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2178 case 0x3000: /* IDEOGRAPHIC SPACE */
2179 break;
2180
2181 default:
2182 ADD_NEW(state_offset + 1, 0);
2183 break;
2184 }
2185 break;
2186
2187 /*-----------------------------------------------------------------*/
2188 case OP_HSPACE:
2189 if (clen > 0) switch(c)
2190 {
2191 case 0x09: /* HT */
2192 case 0x20: /* SPACE */
2193 case 0xa0: /* NBSP */
2194 case 0x1680: /* OGHAM SPACE MARK */
2195 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2196 case 0x2000: /* EN QUAD */
2197 case 0x2001: /* EM QUAD */
2198 case 0x2002: /* EN SPACE */
2199 case 0x2003: /* EM SPACE */
2200 case 0x2004: /* THREE-PER-EM SPACE */
2201 case 0x2005: /* FOUR-PER-EM SPACE */
2202 case 0x2006: /* SIX-PER-EM SPACE */
2203 case 0x2007: /* FIGURE SPACE */
2204 case 0x2008: /* PUNCTUATION SPACE */
2205 case 0x2009: /* THIN SPACE */
2206 case 0x200A: /* HAIR SPACE */
2207 case 0x202f: /* NARROW NO-BREAK SPACE */
2208 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2209 case 0x3000: /* IDEOGRAPHIC SPACE */
2210 ADD_NEW(state_offset + 1, 0);
2211 break;
2212 }
2213 break;
2214
2215 /*-----------------------------------------------------------------*/
2216 /* Match a negated single character casefully. This is only used for
2217 one-byte characters, that is, we know that d < 256. The character we are
2218 checking (c) can be multibyte. */
2219
2220 case OP_NOT:
2221 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2222 break;
2223
2224 /*-----------------------------------------------------------------*/
2225 /* Match a negated single character caselessly. This is only used for
2226 one-byte characters, that is, we know that d < 256. The character we are
2227 checking (c) can be multibyte. */
2228
2229 case OP_NOTI:
2230 if (clen > 0 && c != d && c != fcc[d])
2231 { ADD_NEW(state_offset + dlen + 1, 0); }
2232 break;
2233
2234 /*-----------------------------------------------------------------*/
2235 case OP_PLUSI:
2236 case OP_MINPLUSI:
2237 case OP_POSPLUSI:
2238 case OP_NOTPLUSI:
2239 case OP_NOTMINPLUSI:
2240 case OP_NOTPOSPLUSI:
2241 caseless = TRUE;
2242 codevalue -= OP_STARI - OP_STAR;
2243
2244 /* Fall through */
2245 case OP_PLUS:
2246 case OP_MINPLUS:
2247 case OP_POSPLUS:
2248 case OP_NOTPLUS:
2249 case OP_NOTMINPLUS:
2250 case OP_NOTPOSPLUS:
2251 count = current_state->count; /* Already matched */
2252 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2253 if (clen > 0)
2254 {
2255 unsigned int otherd = NOTACHAR;
2256 if (caseless)
2257 {
2258 #ifdef SUPPORT_UTF
2259 if (utf && d >= 128)
2260 {
2261 #ifdef SUPPORT_UCP
2262 otherd = UCD_OTHERCASE(d);
2263 #endif /* SUPPORT_UCP */
2264 }
2265 else
2266 #endif /* SUPPORT_UTF */
2267 otherd = TABLE_GET(d, fcc, d);
2268 }
2269 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2270 {
2271 if (count > 0 &&
2272 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2273 {
2274 active_count--; /* Remove non-match possibility */
2275 next_active_state--;
2276 }
2277 count++;
2278 ADD_NEW(state_offset, count);
2279 }
2280 }
2281 break;
2282
2283 /*-----------------------------------------------------------------*/
2284 case OP_QUERYI:
2285 case OP_MINQUERYI:
2286 case OP_POSQUERYI:
2287 case OP_NOTQUERYI:
2288 case OP_NOTMINQUERYI:
2289 case OP_NOTPOSQUERYI:
2290 caseless = TRUE;
2291 codevalue -= OP_STARI - OP_STAR;
2292 /* Fall through */
2293 case OP_QUERY:
2294 case OP_MINQUERY:
2295 case OP_POSQUERY:
2296 case OP_NOTQUERY:
2297 case OP_NOTMINQUERY:
2298 case OP_NOTPOSQUERY:
2299 ADD_ACTIVE(state_offset + dlen + 1, 0);
2300 if (clen > 0)
2301 {
2302 unsigned int otherd = NOTACHAR;
2303 if (caseless)
2304 {
2305 #ifdef SUPPORT_UTF
2306 if (utf && d >= 128)
2307 {
2308 #ifdef SUPPORT_UCP
2309 otherd = UCD_OTHERCASE(d);
2310 #endif /* SUPPORT_UCP */
2311 }
2312 else
2313 #endif /* SUPPORT_UTF */
2314 otherd = TABLE_GET(d, fcc, d);
2315 }
2316 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2317 {
2318 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2319 {
2320 active_count--; /* Remove non-match possibility */
2321 next_active_state--;
2322 }
2323 ADD_NEW(state_offset + dlen + 1, 0);
2324 }
2325 }
2326 break;
2327
2328 /*-----------------------------------------------------------------*/
2329 case OP_STARI:
2330 case OP_MINSTARI:
2331 case OP_POSSTARI:
2332 case OP_NOTSTARI:
2333 case OP_NOTMINSTARI:
2334 case OP_NOTPOSSTARI:
2335 caseless = TRUE;
2336 codevalue -= OP_STARI - OP_STAR;
2337 /* Fall through */
2338 case OP_STAR:
2339 case OP_MINSTAR:
2340 case OP_POSSTAR:
2341 case OP_NOTSTAR:
2342 case OP_NOTMINSTAR:
2343 case OP_NOTPOSSTAR:
2344 ADD_ACTIVE(state_offset + dlen + 1, 0);
2345 if (clen > 0)
2346 {
2347 unsigned int otherd = NOTACHAR;
2348 if (caseless)
2349 {
2350 #ifdef SUPPORT_UTF
2351 if (utf && d >= 128)
2352 {
2353 #ifdef SUPPORT_UCP
2354 otherd = UCD_OTHERCASE(d);
2355 #endif /* SUPPORT_UCP */
2356 }
2357 else
2358 #endif /* SUPPORT_UTF */
2359 otherd = TABLE_GET(d, fcc, d);
2360 }
2361 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2362 {
2363 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2364 {
2365 active_count--; /* Remove non-match possibility */
2366 next_active_state--;
2367 }
2368 ADD_NEW(state_offset, 0);
2369 }
2370 }
2371 break;
2372
2373 /*-----------------------------------------------------------------*/
2374 case OP_EXACTI:
2375 case OP_NOTEXACTI:
2376 caseless = TRUE;
2377 codevalue -= OP_STARI - OP_STAR;
2378 /* Fall through */
2379 case OP_EXACT:
2380 case OP_NOTEXACT:
2381 count = current_state->count; /* Number already matched */
2382 if (clen > 0)
2383 {
2384 unsigned int otherd = NOTACHAR;
2385 if (caseless)
2386 {
2387 #ifdef SUPPORT_UTF
2388 if (utf && d >= 128)
2389 {
2390 #ifdef SUPPORT_UCP
2391 otherd = UCD_OTHERCASE(d);
2392 #endif /* SUPPORT_UCP */
2393 }
2394 else
2395 #endif /* SUPPORT_UTF */
2396 otherd = TABLE_GET(d, fcc, d);
2397 }
2398 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2399 {
2400 if (++count >= GET2(code, 1))
2401 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2402 else
2403 { ADD_NEW(state_offset, count); }
2404 }
2405 }
2406 break;
2407
2408 /*-----------------------------------------------------------------*/
2409 case OP_UPTOI:
2410 case OP_MINUPTOI:
2411 case OP_POSUPTOI:
2412 case OP_NOTUPTOI:
2413 case OP_NOTMINUPTOI:
2414 case OP_NOTPOSUPTOI:
2415 caseless = TRUE;
2416 codevalue -= OP_STARI - OP_STAR;
2417 /* Fall through */
2418 case OP_UPTO:
2419 case OP_MINUPTO:
2420 case OP_POSUPTO:
2421 case OP_NOTUPTO:
2422 case OP_NOTMINUPTO:
2423 case OP_NOTPOSUPTO:
2424 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2425 count = current_state->count; /* Number already matched */
2426 if (clen > 0)
2427 {
2428 unsigned int otherd = NOTACHAR;
2429 if (caseless)
2430 {
2431 #ifdef SUPPORT_UTF
2432 if (utf && d >= 128)
2433 {
2434 #ifdef SUPPORT_UCP
2435 otherd = UCD_OTHERCASE(d);
2436 #endif /* SUPPORT_UCP */
2437 }
2438 else
2439 #endif /* SUPPORT_UTF */
2440 otherd = TABLE_GET(d, fcc, d);
2441 }
2442 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2443 {
2444 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2445 {
2446 active_count--; /* Remove non-match possibility */
2447 next_active_state--;
2448 }
2449 if (++count >= GET2(code, 1))
2450 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2451 else
2452 { ADD_NEW(state_offset, count); }
2453 }
2454 }
2455 break;
2456
2457
2458 /* ========================================================================== */
2459 /* These are the class-handling opcodes */
2460
2461 case OP_CLASS:
2462 case OP_NCLASS:
2463 case OP_XCLASS:
2464 {
2465 BOOL isinclass = FALSE;
2466 int next_state_offset;
2467 const pcre_uchar *ecode;
2468
2469 /* For a simple class, there is always just a 32-byte table, and we
2470 can set isinclass from it. */
2471
2472 if (codevalue != OP_XCLASS)
2473 {
2474 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2475 if (clen > 0)
2476 {
2477 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2478 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2479 }
2480 }
2481
2482 /* An extended class may have a table or a list of single characters,
2483 ranges, or both, and it may be positive or negative. There's a
2484 function that sorts all this out. */
2485
2486 else
2487 {
2488 ecode = code + GET(code, 1);
2489 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2490 }
2491
2492 /* At this point, isinclass is set for all kinds of class, and ecode
2493 points to the byte after the end of the class. If there is a
2494 quantifier, this is where it will be. */
2495
2496 next_state_offset = (int)(ecode - start_code);
2497
2498 switch (*ecode)
2499 {
2500 case OP_CRSTAR:
2501 case OP_CRMINSTAR:
2502 ADD_ACTIVE(next_state_offset + 1, 0);
2503 if (isinclass) { ADD_NEW(state_offset, 0); }
2504 break;
2505
2506 case OP_CRPLUS:
2507 case OP_CRMINPLUS:
2508 count = current_state->count; /* Already matched */
2509 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2510 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2511 break;
2512
2513 case OP_CRQUERY:
2514 case OP_CRMINQUERY:
2515 ADD_ACTIVE(next_state_offset + 1, 0);
2516 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2517 break;
2518
2519 case OP_CRRANGE:
2520 case OP_CRMINRANGE:
2521 count = current_state->count; /* Already matched */
2522 if (count >= GET2(ecode, 1))
2523 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2524 if (isinclass)
2525 {
2526 int max = GET2(ecode, 1 + IMM2_SIZE);
2527 if (++count >= max && max != 0) /* Max 0 => no limit */
2528 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2529 else
2530 { ADD_NEW(state_offset, count); }
2531 }
2532 break;
2533
2534 default:
2535 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2536 break;
2537 }
2538 }
2539 break;
2540
2541 /* ========================================================================== */
2542 /* These are the opcodes for fancy brackets of various kinds. We have
2543 to use recursion in order to handle them. The "always failing" assertion
2544 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2545 though the other "backtracking verbs" are not supported. */
2546
2547 case OP_FAIL:
2548 forced_fail++; /* Count FAILs for multiple states */
2549 break;
2550
2551 case OP_ASSERT:
2552 case OP_ASSERT_NOT:
2553 case OP_ASSERTBACK:
2554 case OP_ASSERTBACK_NOT:
2555 {
2556 int rc;
2557 int local_offsets[2];
2558 int local_workspace[1000];
2559 const pcre_uchar *endasscode = code + GET(code, 1);
2560
2561 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2562
2563 rc = internal_dfa_exec(
2564 md, /* static match data */
2565 code, /* this subexpression's code */
2566 ptr, /* where we currently are */
2567 (int)(ptr - start_subject), /* start offset */
2568 local_offsets, /* offset vector */
2569 sizeof(local_offsets)/sizeof(int), /* size of same */
2570 local_workspace, /* workspace vector */
2571 sizeof(local_workspace)/sizeof(int), /* size of same */
2572 rlevel); /* function recursion level */
2573
2574 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2575 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2576 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2577 }
2578 break;
2579
2580 /*-----------------------------------------------------------------*/
2581 case OP_COND:
2582 case OP_SCOND:
2583 {
2584 int local_offsets[1000];
2585 int local_workspace[1000];
2586 int codelink = GET(code, 1);
2587 int condcode;
2588
2589 /* Because of the way auto-callout works during compile, a callout item
2590 is inserted between OP_COND and an assertion condition. This does not
2591 happen for the other conditions. */
2592
2593 if (code[LINK_SIZE+1] == OP_CALLOUT)
2594 {
2595 rrc = 0;
2596 if (PUBL(callout) != NULL)
2597 {
2598 PUBL(callout_block) cb;
2599 cb.version = 1; /* Version 1 of the callout block */
2600 cb.callout_number = code[LINK_SIZE+2];
2601 cb.offset_vector = offsets;
2602 #ifdef COMPILE_PCRE8
2603 cb.subject = (PCRE_SPTR)start_subject;
2604 #else
2605 cb.subject = (PCRE_SPTR16)start_subject;
2606 #endif
2607 cb.subject_length = (int)(end_subject - start_subject);
2608 cb.start_match = (int)(current_subject - start_subject);
2609 cb.current_position = (int)(ptr - start_subject);
2610 cb.pattern_position = GET(code, LINK_SIZE + 3);
2611 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2612 cb.capture_top = 1;
2613 cb.capture_last = -1;
2614 cb.callout_data = md->callout_data;
2615 cb.mark = NULL; /* No (*MARK) support */
2616 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2617 }
2618 if (rrc > 0) break; /* Fail this thread */
2619 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2620 }
2621
2622 condcode = code[LINK_SIZE+1];
2623
2624 /* Back reference conditions are not supported */
2625
2626 if (condcode == OP_CREF || condcode == OP_NCREF)
2627 return PCRE_ERROR_DFA_UCOND;
2628
2629 /* The DEFINE condition is always false */
2630
2631 if (condcode == OP_DEF)
2632 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2633
2634 /* The only supported version of OP_RREF is for the value RREF_ANY,
2635 which means "test if in any recursion". We can't test for specifically
2636 recursed groups. */
2637
2638 else if (condcode == OP_RREF || condcode == OP_NRREF)
2639 {
2640 int value = GET2(code, LINK_SIZE + 2);
2641 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2642 if (md->recursive != NULL)
2643 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2644 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2645 }
2646
2647 /* Otherwise, the condition is an assertion */
2648
2649 else
2650 {
2651 int rc;
2652 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2653 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2654
2655 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2656
2657 rc = internal_dfa_exec(
2658 md, /* fixed match data */
2659 asscode, /* this subexpression's code */
2660 ptr, /* where we currently are */
2661 (int)(ptr - start_subject), /* start offset */
2662 local_offsets, /* offset vector */
2663 sizeof(local_offsets)/sizeof(int), /* size of same */
2664 local_workspace, /* workspace vector */
2665 sizeof(local_workspace)/sizeof(int), /* size of same */
2666 rlevel); /* function recursion level */
2667
2668 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2669 if ((rc >= 0) ==
2670 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2671 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2672 else
2673 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2674 }
2675 }
2676 break;
2677
2678 /*-----------------------------------------------------------------*/
2679 case OP_RECURSE:
2680 {
2681 dfa_recursion_info *ri;
2682 int local_offsets[1000];
2683 int local_workspace[1000];
2684 const pcre_uchar *callpat = start_code + GET(code, 1);
2685 int recno = (callpat == md->start_code)? 0 :
2686 GET2(callpat, 1 + LINK_SIZE);
2687 int rc;
2688
2689 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2690
2691 /* Check for repeating a recursion without advancing the subject
2692 pointer. This should catch convoluted mutual recursions. (Some simple
2693 cases are caught at compile time.) */
2694
2695 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2696 if (recno == ri->group_num && ptr == ri->subject_position)
2697 return PCRE_ERROR_RECURSELOOP;
2698
2699 /* Remember this recursion and where we started it so as to
2700 catch infinite loops. */
2701
2702 new_recursive.group_num = recno;
2703 new_recursive.subject_position = ptr;
2704 new_recursive.prevrec = md->recursive;
2705 md->recursive = &new_recursive;
2706
2707 rc = internal_dfa_exec(
2708 md, /* fixed match data */
2709 callpat, /* this subexpression's code */
2710 ptr, /* where we currently are */
2711 (int)(ptr - start_subject), /* start offset */
2712 local_offsets, /* offset vector */
2713 sizeof(local_offsets)/sizeof(int), /* size of same */
2714 local_workspace, /* workspace vector */
2715 sizeof(local_workspace)/sizeof(int), /* size of same */
2716 rlevel); /* function recursion level */
2717
2718 md->recursive = new_recursive.prevrec; /* Done this recursion */
2719
2720 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2721 rc));
2722
2723 /* Ran out of internal offsets */
2724
2725 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2726
2727 /* For each successful matched substring, set up the next state with a
2728 count of characters to skip before trying it. Note that the count is in
2729 characters, not bytes. */
2730
2731 if (rc > 0)
2732 {
2733 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2734 {
2735 int charcount = local_offsets[rc+1] - local_offsets[rc];
2736 #ifdef SUPPORT_UTF
2737 const pcre_uchar *p = start_subject + local_offsets[rc];
2738 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2739 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2740 #endif
2741 if (charcount > 0)
2742 {
2743 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2744 }
2745 else
2746 {
2747 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2748 }
2749 }
2750 }
2751 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2752 }
2753 break;
2754
2755 /*-----------------------------------------------------------------*/
2756 case OP_BRAPOS:
2757 case OP_SBRAPOS:
2758 case OP_CBRAPOS:
2759 case OP_SCBRAPOS:
2760 case OP_BRAPOSZERO:
2761 {
2762 int charcount, matched_count;
2763 const pcre_uchar *local_ptr = ptr;
2764 BOOL allow_zero;
2765
2766 if (codevalue == OP_BRAPOSZERO)
2767 {
2768 allow_zero = TRUE;
2769 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2770 }
2771 else allow_zero = FALSE;
2772
2773 /* Loop to match the subpattern as many times as possible as if it were
2774 a complete pattern. */
2775
2776 for (matched_count = 0;; matched_count++)
2777 {
2778 int local_offsets[2];
2779 int local_workspace[1000];
2780
2781 int rc = internal_dfa_exec(
2782 md, /* fixed match data */
2783 code, /* this subexpression's code */
2784 local_ptr, /* where we currently are */
2785 (int)(ptr - start_subject), /* start offset */
2786 local_offsets, /* offset vector */
2787 sizeof(local_offsets)/sizeof(int), /* size of same */
2788 local_workspace, /* workspace vector */
2789 sizeof(local_workspace)/sizeof(int), /* size of same */
2790 rlevel); /* function recursion level */
2791
2792 /* Failed to match */
2793
2794 if (rc < 0)
2795 {
2796 if (rc != PCRE_ERROR_NOMATCH) return rc;
2797 break;
2798 }
2799
2800 /* Matched: break the loop if zero characters matched. */
2801
2802 charcount = local_offsets[1] - local_offsets[0];
2803 if (charcount == 0) break;
2804 local_ptr += charcount; /* Advance temporary position ptr */
2805 }
2806
2807 /* At this point we have matched the subpattern matched_count
2808 times, and local_ptr is pointing to the character after the end of the
2809 last match. */
2810
2811 if (matched_count > 0 || allow_zero)
2812 {
2813 const pcre_uchar *end_subpattern = code;
2814 int next_state_offset;
2815
2816 do { end_subpattern += GET(end_subpattern, 1); }
2817 while (*end_subpattern == OP_ALT);
2818 next_state_offset =
2819 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2820
2821 /* Optimization: if there are no more active states, and there
2822 are no new states yet set up, then skip over the subject string
2823 right here, to save looping. Otherwise, set up the new state to swing
2824 into action when the end of the matched substring is reached. */
2825
2826 if (i + 1 >= active_count && new_count == 0)
2827 {
2828 ptr = local_ptr;
2829 clen = 0;
2830 ADD_NEW(next_state_offset, 0);
2831 }
2832 else
2833 {
2834 const pcre_uchar *p = ptr;
2835 const pcre_uchar *pp = local_ptr;
2836 charcount = (int)(pp - p);
2837 #ifdef SUPPORT_UTF
2838 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2839 #endif
2840 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2841 }
2842 }
2843 }
2844 break;
2845
2846 /*-----------------------------------------------------------------*/
2847 case OP_ONCE:
2848 case OP_ONCE_NC:
2849 {
2850 int local_offsets[2];
2851 int local_workspace[1000];
2852
2853 int rc = internal_dfa_exec(
2854 md, /* fixed match data */
2855 code, /* this subexpression's code */
2856 ptr, /* where we currently are */
2857 (int)(ptr - start_subject), /* start offset */
2858 local_offsets, /* offset vector */
2859 sizeof(local_offsets)/sizeof(int), /* size of same */
2860 local_workspace, /* workspace vector */
2861 sizeof(local_workspace)/sizeof(int), /* size of same */
2862 rlevel); /* function recursion level */
2863
2864 if (rc >= 0)
2865 {
2866 const pcre_uchar *end_subpattern = code;
2867 int charcount = local_offsets[1] - local_offsets[0];
2868 int next_state_offset, repeat_state_offset;
2869
2870 do { end_subpattern += GET(end_subpattern, 1); }
2871 while (*end_subpattern == OP_ALT);
2872 next_state_offset =
2873 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2874
2875 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2876 arrange for the repeat state also to be added to the relevant list.
2877 Calculate the offset, or set -1 for no repeat. */
2878
2879 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2880 *end_subpattern == OP_KETRMIN)?
2881 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2882
2883 /* If we have matched an empty string, add the next state at the
2884 current character pointer. This is important so that the duplicate
2885 checking kicks in, which is what breaks infinite loops that match an
2886 empty string. */
2887
2888 if (charcount == 0)
2889 {
2890 ADD_ACTIVE(next_state_offset, 0);
2891 }
2892
2893 /* Optimization: if there are no more active states, and there
2894 are no new states yet set up, then skip over the subject string
2895 right here, to save looping. Otherwise, set up the new state to swing
2896 into action when the end of the matched substring is reached. */
2897
2898 else if (i + 1 >= active_count && new_count == 0)
2899 {
2900 ptr += charcount;
2901 clen = 0;
2902 ADD_NEW(next_state_offset, 0);
2903
2904 /* If we are adding a repeat state at the new character position,
2905 we must fudge things so that it is the only current state.
2906 Otherwise, it might be a duplicate of one we processed before, and
2907 that would cause it to be skipped. */
2908
2909 if (repeat_state_offset >= 0)
2910 {
2911 next_active_state = active_states;
2912 active_count = 0;
2913 i = -1;
2914 ADD_ACTIVE(repeat_state_offset, 0);
2915 }
2916 }
2917 else
2918 {
2919 #ifdef SUPPORT_UTF
2920 const pcre_uchar *p = start_subject + local_offsets[0];
2921 const pcre_uchar *pp = start_subject + local_offsets[1];
2922 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2923 #endif
2924 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2925 if (repeat_state_offset >= 0)
2926 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2927 }
2928 }
2929 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2930 }
2931 break;
2932
2933
2934 /* ========================================================================== */
2935 /* Handle callouts */
2936
2937 case OP_CALLOUT:
2938 rrc = 0;
2939 if (PUBL(callout) != NULL)
2940 {
2941 PUBL(callout_block) cb;
2942 cb.version = 1; /* Version 1 of the callout block */
2943 cb.callout_number = code[1];
2944 cb.offset_vector = offsets;
2945 #ifdef COMPILE_PCRE8
2946 cb.subject = (PCRE_SPTR)start_subject;
2947 #else
2948 cb.subject = (PCRE_SPTR16)start_subject;
2949 #endif
2950 cb.subject_length = (int)(end_subject - start_subject);
2951 cb.start_match = (int)(current_subject - start_subject);
2952 cb.current_position = (int)(ptr - start_subject);
2953 cb.pattern_position = GET(code, 2);
2954 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2955 cb.capture_top = 1;
2956 cb.capture_last = -1;
2957 cb.callout_data = md->callout_data;
2958 cb.mark = NULL; /* No (*MARK) support */
2959 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2960 }
2961 if (rrc == 0)
2962 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
2963 break;
2964
2965
2966 /* ========================================================================== */
2967 default: /* Unsupported opcode */
2968 return PCRE_ERROR_DFA_UITEM;
2969 }
2970
2971 NEXT_ACTIVE_STATE: continue;
2972
2973 } /* End of loop scanning active states */
2974
2975 /* We have finished the processing at the current subject character. If no
2976 new states have been set for the next character, we have found all the
2977 matches that we are going to find. If we are at the top level and partial
2978 matching has been requested, check for appropriate conditions.
2979
2980 The "forced_ fail" variable counts the number of (*F) encountered for the
2981 character. If it is equal to the original active_count (saved in
2982 workspace[1]) it means that (*F) was found on every active state. In this
2983 case we don't want to give a partial match.
2984
2985 The "could_continue" variable is true if a state could have continued but
2986 for the fact that the end of the subject was reached. */
2987
2988 if (new_count <= 0)
2989 {
2990 if (rlevel == 1 && /* Top level, and */
2991 could_continue && /* Some could go on */
2992 forced_fail != workspace[1] && /* Not all forced fail & */
2993 ( /* either... */
2994 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
2995 || /* or... */
2996 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
2997 match_count < 0) /* no matches */
2998 ) && /* And... */
2999 (
3000 ptr >= end_subject || /* Reached end of subject or */
3001 partial_newline /* a partial newline */
3002 ) &&
3003 ptr > md->start_used_ptr) /* Inspected non-empty string */
3004 {
3005 if (offsetcount >= 2)
3006 {
3007 offsets[0] = (int)(md->start_used_ptr - start_subject);
3008 offsets[1] = (int)(end_subject - start_subject);
3009 }
3010 match_count = PCRE_ERROR_PARTIAL;
3011 }
3012
3013 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
3014 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
3015 rlevel*2-2, SP));
3016 break; /* In effect, "return", but see the comment below */
3017 }
3018
3019 /* One or more states are active for the next character. */
3020
3021 ptr += clen; /* Advance to next subject character */
3022 } /* Loop to move along the subject string */
3023
3024 /* Control gets here from "break" a few lines above. We do it this way because
3025 if we use "return" above, we have compiler trouble. Some compilers warn if
3026 there's nothing here because they think the function doesn't return a value. On
3027 the other hand, if we put a dummy statement here, some more clever compilers
3028 complain that it can't be reached. Sigh. */
3029
3030 return match_count;
3031 }
3032
3033
3034
3035
3036 /*************************************************
3037 * Execute a Regular Expression - DFA engine *
3038 *************************************************/
3039
3040 /* This external function applies a compiled re to a subject string using a DFA
3041 engine. This function calls the internal function multiple times if the pattern
3042 is not anchored.
3043
3044 Arguments:
3045 argument_re points to the compiled expression
3046 extra_data points to extra data or is NULL
3047 subject points to the subject string
3048 length length of subject string (may contain binary zeros)
3049 start_offset where to start in the subject string
3050 options option bits
3051 offsets vector of match offsets
3052 offsetcount size of same
3053 workspace workspace vector
3054 wscount size of same
3055
3056 Returns: > 0 => number of match offset pairs placed in offsets
3057 = 0 => offsets overflowed; longest matches are present
3058 -1 => failed to match
3059 < -1 => some kind of unexpected problem
3060 */
3061
3062 #ifdef COMPILE_PCRE8
3063 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3064 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3065 const char *subject, int length, int start_offset, int options, int *offsets,
3066 int offsetcount, int *workspace, int wscount)
3067 #else
3068 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3069 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3070 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3071 int offsetcount, int *workspace, int wscount)
3072 #endif
3073 {
3074 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3075 dfa_match_data match_block;
3076 dfa_match_data *md = &match_block;
3077 BOOL utf, anchored, startline, firstline;
3078 const pcre_uchar *current_subject, *end_subject;
3079 const pcre_study_data *study = NULL;
3080
3081 const pcre_uchar *req_char_ptr;
3082 const pcre_uint8 *start_bits = NULL;
3083 BOOL has_first_char = FALSE;
3084 BOOL has_req_char = FALSE;
3085 pcre_uchar first_char = 0;
3086 pcre_uchar first_char2 = 0;
3087 pcre_uchar req_char = 0;
3088 pcre_uchar req_char2 = 0;
3089 int newline;
3090
3091 /* Plausibility checks */
3092
3093 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3094 if (re == NULL || subject == NULL || workspace == NULL ||
3095 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3096 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3097 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3098 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3099
3100 /* We need to find the pointer to any study data before we test for byte
3101 flipping, so we scan the extra_data block first. This may set two fields in the
3102 match block, so we must initialize them beforehand. However, the other fields
3103 in the match block must not be set until after the byte flipping. */
3104
3105 md->tables = re->tables;
3106 md->callout_data = NULL;
3107
3108 if (extra_data != NULL)
3109 {
3110 unsigned int flags = extra_data->flags;
3111 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3112 study = (const pcre_study_data *)extra_data->study_data;
3113 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3114 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3115 return PCRE_ERROR_DFA_UMLIMIT;
3116 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3117 md->callout_data = extra_data->callout_data;
3118 if ((flags & PCRE_EXTRA_TABLES) != 0)
3119 md->tables = extra_data->tables;
3120 }
3121
3122 /* Check that the first field in the block is the magic number. If it is not,
3123 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3124 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3125 means that the pattern is likely compiled with different endianness. */
3126
3127 if (re->magic_number != MAGIC_NUMBER)
3128 return re->magic_number == REVERSED_MAGIC_NUMBER?
3129 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3130 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3131
3132 /* Set some local values */
3133
3134 current_subject = (const pcre_uchar *)subject + start_offset;
3135 end_subject = (const pcre_uchar *)subject + length;
3136 req_char_ptr = current_subject - 1;
3137
3138 #ifdef SUPPORT_UTF
3139 /* PCRE_UTF16 has the same value as PCRE_UTF8. */
3140 utf = (re->options & PCRE_UTF8) != 0;
3141 #else
3142 utf = FALSE;
3143 #endif
3144
3145 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3146 (re->options & PCRE_ANCHORED) != 0;
3147
3148 /* The remaining fixed data for passing around. */
3149
3150 md->start_code = (const pcre_uchar *)argument_re +
3151 re->name_table_offset + re->name_count * re->name_entry_size;
3152 md->start_subject = (const pcre_uchar *)subject;
3153 md->end_subject = end_subject;
3154 md->start_offset = start_offset;
3155 md->moptions = options;
3156 md->poptions = re->options;
3157
3158 /* If the BSR option is not set at match time, copy what was set
3159 at compile time. */
3160
3161 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3162 {
3163 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3164 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3165 #ifdef BSR_ANYCRLF
3166 else md->moptions |= PCRE_BSR_ANYCRLF;
3167 #endif
3168 }
3169
3170 /* Handle different types of newline. The three bits give eight cases. If
3171 nothing is set at run time, whatever was used at compile time applies. */
3172
3173 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3174 PCRE_NEWLINE_BITS)
3175 {
3176 case 0: newline = NEWLINE; break; /* Compile-time default */
3177 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3178 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3179 case PCRE_NEWLINE_CR+
3180 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3181 case PCRE_NEWLINE_ANY: newline = -1; break;
3182 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3183 default: return PCRE_ERROR_BADNEWLINE;
3184 }
3185
3186 if (newline == -2)
3187 {
3188 md->nltype = NLTYPE_ANYCRLF;
3189 }
3190 else if (newline < 0)
3191 {
3192 md->nltype = NLTYPE_ANY;
3193 }
3194 else
3195 {
3196 md->nltype = NLTYPE_FIXED;
3197 if (newline > 255)
3198 {
3199 md->nllen = 2;
3200 md->nl[0] = (newline >> 8) & 255;
3201 md->nl[1] = newline & 255;
3202 }
3203 else
3204 {
3205 md->nllen = 1;
3206 md->nl[0] = newline;
3207 }
3208 }
3209
3210 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3211 back the character offset. */
3212
3213 #ifdef SUPPORT_UTF
3214 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3215 {
3216 int erroroffset;
3217 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3218 if (errorcode != 0)
3219 {
3220 if (offsetcount >= 2)
3221 {
3222 offsets[0] = erroroffset;
3223 offsets[1] = errorcode;
3224 }
3225 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3226 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3227 }
3228 if (start_offset > 0 && start_offset < length &&
3229 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3230 return PCRE_ERROR_BADUTF8_OFFSET;
3231 }
3232 #endif
3233
3234 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3235 is a feature that makes it possible to save compiled regex and re-use them
3236 in other programs later. */
3237
3238 if (md->tables == NULL) md->tables = PRIV(default_tables);
3239
3240 /* The "must be at the start of a line" flags are used in a loop when finding
3241 where to start. */
3242
3243 startline = (re->flags & PCRE_STARTLINE) != 0;
3244 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3245
3246 /* Set up the first character to match, if available. The first_byte value is
3247 never set for an anchored regular expression, but the anchoring may be forced
3248 at run time, so we have to test for anchoring. The first char may be unset for
3249 an unanchored pattern, of course. If there's no first char and the pattern was
3250 studied, there may be a bitmap of possible first characters. */
3251
3252 if (!anchored)
3253 {
3254 if ((re->flags & PCRE_FIRSTSET) != 0)
3255 {
3256 has_first_char = TRUE;
3257 first_char = first_char2 = (pcre_uchar)(re->first_char);
3258 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3259 {
3260 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3261 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3262 if (utf && first_char > 127)
3263 first_char2 = UCD_OTHERCASE(first_char);
3264 #endif
3265 }
3266 }
3267 else
3268 {
3269 if (!startline && study != NULL &&
3270 (study->flags & PCRE_STUDY_MAPPED) != 0)
3271 start_bits = study->start_bits;
3272 }
3273 }
3274
3275 /* For anchored or unanchored matches, there may be a "last known required
3276 character" set. */
3277
3278 if ((re->flags & PCRE_REQCHSET) != 0)
3279 {
3280 has_req_char = TRUE;
3281 req_char = req_char2 = (pcre_uchar)(re->req_char);
3282 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3283 {
3284 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3285 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3286 if (utf && req_char > 127)
3287 req_char2 = UCD_OTHERCASE(req_char);
3288 #endif
3289 }
3290 }
3291
3292 /* Call the main matching function, looping for a non-anchored regex after a
3293 failed match. If not restarting, perform certain optimizations at the start of
3294 a match. */
3295
3296 for (;;)
3297 {
3298 int rc;
3299
3300 if ((options & PCRE_DFA_RESTART) == 0)
3301 {
3302 const pcre_uchar *save_end_subject = end_subject;
3303
3304 /* If firstline is TRUE, the start of the match is constrained to the first
3305 line of a multiline string. Implement this by temporarily adjusting
3306 end_subject so that we stop scanning at a newline. If the match fails at
3307 the newline, later code breaks this loop. */
3308
3309 if (firstline)
3310 {
3311 PCRE_PUCHAR t = current_subject;
3312 #ifdef SUPPORT_UTF
3313 if (utf)
3314 {
3315 while (t < md->end_subject && !IS_NEWLINE(t))
3316 {
3317 t++;
3318 ACROSSCHAR(t < end_subject, *t, t++);
3319 }
3320 }
3321 else
3322 #endif
3323 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3324 end_subject = t;
3325 }
3326
3327 /* There are some optimizations that avoid running the match if a known
3328 starting point is not found. However, there is an option that disables
3329 these, for testing and for ensuring that all callouts do actually occur.
3330 The option can be set in the regex by (*NO_START_OPT) or passed in
3331 match-time options. */
3332
3333 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3334 {
3335 /* Advance to a known first char. */
3336
3337 if (has_first_char)
3338 {
3339 if (first_char != first_char2)
3340 while (current_subject < end_subject &&
3341 *current_subject != first_char && *current_subject != first_char2)
3342 current_subject++;
3343 else
3344 while (current_subject < end_subject &&
3345 *current_subject != first_char)
3346 current_subject++;
3347 }
3348
3349 /* Or to just after a linebreak for a multiline match if possible */
3350
3351 else if (startline)
3352 {
3353 if (current_subject > md->start_subject + start_offset)
3354 {
3355 #ifdef SUPPORT_UTF
3356 if (utf)
3357 {
3358 while (current_subject < end_subject &&
3359 !WAS_NEWLINE(current_subject))
3360 {
3361 current_subject++;
3362 ACROSSCHAR(current_subject < end_subject, *current_subject,
3363 current_subject++);
3364 }
3365 }
3366 else
3367 #endif
3368 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3369 current_subject++;
3370
3371 /* If we have just passed a CR and the newline option is ANY or
3372 ANYCRLF, and we are now at a LF, advance the match position by one
3373 more character. */
3374
3375 if (current_subject[-1] == CHAR_CR &&
3376 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3377 current_subject < end_subject &&
3378 *current_subject == CHAR_NL)
3379 current_subject++;
3380 }
3381 }
3382
3383 /* Or to a non-unique first char after study */
3384
3385 else if (start_bits != NULL)
3386 {
3387 while (current_subject < end_subject)
3388 {
3389 register unsigned int c = *current_subject;
3390 #ifndef COMPILE_PCRE8
3391 if (c > 255) c = 255;
3392 #endif
3393 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3394 {
3395 current_subject++;
3396 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3397 /* In non 8-bit mode, the iteration will stop for
3398 characters > 255 at the beginning or not stop at all. */
3399 if (utf)
3400 ACROSSCHAR(current_subject < end_subject, *current_subject,
3401 current_subject++);
3402 #endif
3403 }
3404 else break;
3405 }
3406 }
3407 }
3408
3409 /* Restore fudged end_subject */
3410
3411 end_subject = save_end_subject;
3412
3413 /* The following two optimizations are disabled for partial matching or if
3414 disabling is explicitly requested (and of course, by the test above, this
3415 code is not obeyed when restarting after a partial match). */
3416
3417 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3418 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3419 {
3420 /* If the pattern was studied, a minimum subject length may be set. This
3421 is a lower bound; no actual string of that length may actually match the
3422 pattern. Although the value is, strictly, in characters, we treat it as
3423 bytes to avoid spending too much time in this optimization. */
3424
3425 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3426 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3427 return PCRE_ERROR_NOMATCH;
3428
3429 /* If req_char is set, we know that that character must appear in the
3430 subject for the match to succeed. If the first character is set, req_char
3431 must be later in the subject; otherwise the test starts at the match
3432 point. This optimization can save a huge amount of work in patterns with
3433 nested unlimited repeats that aren't going to match. Writing separate
3434 code for cased/caseless versions makes it go faster, as does using an
3435 autoincrement and backing off on a match.
3436
3437 HOWEVER: when the subject string is very, very long, searching to its end
3438 can take a long time, and give bad performance on quite ordinary
3439 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3440 string... so we don't do this when the string is sufficiently long. */
3441
3442 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3443 {
3444 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3445
3446 /* We don't need to repeat the search if we haven't yet reached the
3447 place we found it at last time. */
3448
3449 if (p > req_char_ptr)
3450 {
3451 if (req_char != req_char2)
3452 {
3453 while (p < end_subject)
3454 {
3455 register int pp = *p++;
3456 if (pp == req_char || pp == req_char2) { p--; break; }
3457 }
3458 }
3459 else
3460 {
3461 while (p < end_subject)
3462 {
3463 if (*p++ == req_char) { p--; break; }
3464 }
3465 }
3466
3467 /* If we can't find the required character, break the matching loop,
3468 which will cause a return or PCRE_ERROR_NOMATCH. */
3469
3470 if (p >= end_subject) break;
3471
3472 /* If we have found the required character, save the point where we
3473 found it, so that we don't search again next time round the loop if
3474 the start hasn't passed this character yet. */
3475
3476 req_char_ptr = p;
3477 }
3478 }
3479 }
3480 } /* End of optimizations that are done when not restarting */
3481
3482 /* OK, now we can do the business */
3483
3484 md->start_used_ptr = current_subject;
3485 md->recursive = NULL;
3486
3487 rc = internal_dfa_exec(
3488 md, /* fixed match data */
3489 md->start_code, /* this subexpression's code */
3490 current_subject, /* where we currently are */
3491 start_offset, /* start offset in subject */
3492 offsets, /* offset vector */
3493 offsetcount, /* size of same */
3494 workspace, /* workspace vector */
3495 wscount, /* size of same */
3496 0); /* function recurse level */
3497
3498 /* Anything other than "no match" means we are done, always; otherwise, carry
3499 on only if not anchored. */
3500
3501 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3502
3503 /* Advance to the next subject character unless we are at the end of a line
3504 and firstline is set. */
3505
3506 if (firstline && IS_NEWLINE(current_subject)) break;
3507 current_subject++;
3508 #ifdef SUPPORT_UTF
3509 if (utf)
3510 {
3511 ACROSSCHAR(current_subject < end_subject, *current_subject,
3512 current_subject++);
3513 }
3514 #endif
3515 if (current_subject > end_subject) break;
3516
3517 /* If we have just passed a CR and we are now at a LF, and the pattern does
3518 not contain any explicit matches for \r or \n, and the newline option is CRLF
3519 or ANY or ANYCRLF, advance the match position by one more character. */
3520
3521 if (current_subject[-1] == CHAR_CR &&
3522 current_subject < end_subject &&
3523 *current_subject == CHAR_NL &&
3524 (re->flags & PCRE_HASCRORLF) == 0 &&
3525 (md->nltype == NLTYPE_ANY ||
3526 md->nltype == NLTYPE_ANYCRLF ||
3527 md->nllen == 2))
3528 current_subject++;
3529
3530 } /* "Bumpalong" loop */
3531
3532 return PCRE_ERROR_NOMATCH;
3533 }
3534
3535 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5