/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 894 - (show annotations)
Fri Jan 20 11:58:19 2012 UTC (7 years, 10 months ago) by ph10
File MIME type: text/plain
File size: 121668 byte(s)
Error occurred while calculating annotation data.
Avoid compiler warning when UTF not configured.
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2012 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41
42 /* This module contains the external function pcre_dfa_exec(), which is an
43 alternative matching function that uses a sort of DFA algorithm (not a true
44 FSM). This is NOT Perl- compatible, but it has advantages in certain
45 applications. */
46
47
48 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
49 the performance of his patterns greatly. I could not use it as it stood, as it
50 was not thread safe, and made assumptions about pattern sizes. Also, it caused
51 test 7 to loop, and test 9 to crash with a segfault.
52
53 The issue is the check for duplicate states, which is done by a simple linear
54 search up the state list. (Grep for "duplicate" below to find the code.) For
55 many patterns, there will never be many states active at one time, so a simple
56 linear search is fine. In patterns that have many active states, it might be a
57 bottleneck. The suggested code used an indexing scheme to remember which states
58 had previously been used for each character, and avoided the linear search when
59 it knew there was no chance of a duplicate. This was implemented when adding
60 states to the state lists.
61
62 I wrote some thread-safe, not-limited code to try something similar at the time
63 of checking for duplicates (instead of when adding states), using index vectors
64 on the stack. It did give a 13% improvement with one specially constructed
65 pattern for certain subject strings, but on other strings and on many of the
66 simpler patterns in the test suite it did worse. The major problem, I think,
67 was the extra time to initialize the index. This had to be done for each call
68 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
69 only once - I suspect this was the cause of the problems with the tests.)
70
71 Overall, I concluded that the gains in some cases did not outweigh the losses
72 in others, so I abandoned this code. */
73
74
75
76 #ifdef HAVE_CONFIG_H
77 #include "config.h"
78 #endif
79
80 #define NLBLOCK md /* Block containing newline information */
81 #define PSSTART start_subject /* Field containing processed string start */
82 #define PSEND end_subject /* Field containing processed string end */
83
84 #include "pcre_internal.h"
85
86
87 /* For use to indent debugging output */
88
89 #define SP " "
90
91
92 /*************************************************
93 * Code parameters and static tables *
94 *************************************************/
95
96 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
97 into others, under special conditions. A gap of 20 between the blocks should be
98 enough. The resulting opcodes don't have to be less than 256 because they are
99 never stored, so we push them well clear of the normal opcodes. */
100
101 #define OP_PROP_EXTRA 300
102 #define OP_EXTUNI_EXTRA 320
103 #define OP_ANYNL_EXTRA 340
104 #define OP_HSPACE_EXTRA 360
105 #define OP_VSPACE_EXTRA 380
106
107
108 /* This table identifies those opcodes that are followed immediately by a
109 character that is to be tested in some way. This makes it possible to
110 centralize the loading of these characters. In the case of Type * etc, the
111 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
112 small value. Non-zero values in the table are the offsets from the opcode where
113 the character is to be found. ***NOTE*** If the start of this table is
114 modified, the three tables that follow must also be modified. */
115
116 static const pcre_uint8 coptable[] = {
117 0, /* End */
118 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
119 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
120 0, 0, 0, /* Any, AllAny, Anybyte */
121 0, 0, /* \P, \p */
122 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
123 0, /* \X */
124 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
125 1, /* Char */
126 1, /* Chari */
127 1, /* not */
128 1, /* noti */
129 /* Positive single-char repeats */
130 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
131 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto, minupto */
132 1+IMM2_SIZE, /* exact */
133 1, 1, 1, 1+IMM2_SIZE, /* *+, ++, ?+, upto+ */
134 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
135 1+IMM2_SIZE, 1+IMM2_SIZE, /* upto I, minupto I */
136 1+IMM2_SIZE, /* exact I */
137 1, 1, 1, 1+IMM2_SIZE, /* *+I, ++I, ?+I, upto+I */
138 /* Negative single-char repeats - only for chars < 256 */
139 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
140 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto, minupto */
141 1+IMM2_SIZE, /* NOT exact */
142 1, 1, 1, 1+IMM2_SIZE, /* NOT *+, ++, ?+, upto+ */
143 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
144 1+IMM2_SIZE, 1+IMM2_SIZE, /* NOT upto I, minupto I */
145 1+IMM2_SIZE, /* NOT exact I */
146 1, 1, 1, 1+IMM2_SIZE, /* NOT *+I, ++I, ?+I, upto+I */
147 /* Positive type repeats */
148 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
149 1+IMM2_SIZE, 1+IMM2_SIZE, /* Type upto, minupto */
150 1+IMM2_SIZE, /* Type exact */
151 1, 1, 1, 1+IMM2_SIZE, /* Type *+, ++, ?+, upto+ */
152 /* Character class & ref repeats */
153 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
154 0, 0, /* CRRANGE, CRMINRANGE */
155 0, /* CLASS */
156 0, /* NCLASS */
157 0, /* XCLASS - variable length */
158 0, /* REF */
159 0, /* REFI */
160 0, /* RECURSE */
161 0, /* CALLOUT */
162 0, /* Alt */
163 0, /* Ket */
164 0, /* KetRmax */
165 0, /* KetRmin */
166 0, /* KetRpos */
167 0, /* Reverse */
168 0, /* Assert */
169 0, /* Assert not */
170 0, /* Assert behind */
171 0, /* Assert behind not */
172 0, 0, /* ONCE, ONCE_NC */
173 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
174 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
175 0, 0, /* CREF, NCREF */
176 0, 0, /* RREF, NRREF */
177 0, /* DEF */
178 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
179 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
180 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
181 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
182 0, 0 /* CLOSE, SKIPZERO */
183 };
184
185 /* This table identifies those opcodes that inspect a character. It is used to
186 remember the fact that a character could have been inspected when the end of
187 the subject is reached. ***NOTE*** If the start of this table is modified, the
188 two tables that follow must also be modified. */
189
190 static const pcre_uint8 poptable[] = {
191 0, /* End */
192 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
193 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
194 1, 1, 1, /* Any, AllAny, Anybyte */
195 1, 1, /* \P, \p */
196 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
197 1, /* \X */
198 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
199 1, /* Char */
200 1, /* Chari */
201 1, /* not */
202 1, /* noti */
203 /* Positive single-char repeats */
204 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
205 1, 1, 1, /* upto, minupto, exact */
206 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
207 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
208 1, 1, 1, /* upto I, minupto I, exact I */
209 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
210 /* Negative single-char repeats - only for chars < 256 */
211 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
212 1, 1, 1, /* NOT upto, minupto, exact */
213 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
214 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
215 1, 1, 1, /* NOT upto I, minupto I, exact I */
216 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
217 /* Positive type repeats */
218 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
219 1, 1, 1, /* Type upto, minupto, exact */
220 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
221 /* Character class & ref repeats */
222 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
223 1, 1, /* CRRANGE, CRMINRANGE */
224 1, /* CLASS */
225 1, /* NCLASS */
226 1, /* XCLASS - variable length */
227 0, /* REF */
228 0, /* REFI */
229 0, /* RECURSE */
230 0, /* CALLOUT */
231 0, /* Alt */
232 0, /* Ket */
233 0, /* KetRmax */
234 0, /* KetRmin */
235 0, /* KetRpos */
236 0, /* Reverse */
237 0, /* Assert */
238 0, /* Assert not */
239 0, /* Assert behind */
240 0, /* Assert behind not */
241 0, 0, /* ONCE, ONCE_NC */
242 0, 0, 0, 0, 0, /* BRA, BRAPOS, CBRA, CBRAPOS, COND */
243 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
244 0, 0, /* CREF, NCREF */
245 0, 0, /* RREF, NRREF */
246 0, /* DEF */
247 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
248 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
249 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
250 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
251 0, 0 /* CLOSE, SKIPZERO */
252 };
253
254 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
255 and \w */
256
257 static const pcre_uint8 toptable1[] = {
258 0, 0, 0, 0, 0, 0,
259 ctype_digit, ctype_digit,
260 ctype_space, ctype_space,
261 ctype_word, ctype_word,
262 0, 0 /* OP_ANY, OP_ALLANY */
263 };
264
265 static const pcre_uint8 toptable2[] = {
266 0, 0, 0, 0, 0, 0,
267 ctype_digit, 0,
268 ctype_space, 0,
269 ctype_word, 0,
270 1, 1 /* OP_ANY, OP_ALLANY */
271 };
272
273
274 /* Structure for holding data about a particular state, which is in effect the
275 current data for an active path through the match tree. It must consist
276 entirely of ints because the working vector we are passed, and which we put
277 these structures in, is a vector of ints. */
278
279 typedef struct stateblock {
280 int offset; /* Offset to opcode */
281 int count; /* Count for repeats */
282 int data; /* Some use extra data */
283 } stateblock;
284
285 #define INTS_PER_STATEBLOCK (sizeof(stateblock)/sizeof(int))
286
287
288 #ifdef PCRE_DEBUG
289 /*************************************************
290 * Print character string *
291 *************************************************/
292
293 /* Character string printing function for debugging.
294
295 Arguments:
296 p points to string
297 length number of bytes
298 f where to print
299
300 Returns: nothing
301 */
302
303 static void
304 pchars(const pcre_uchar *p, int length, FILE *f)
305 {
306 int c;
307 while (length-- > 0)
308 {
309 if (isprint(c = *(p++)))
310 fprintf(f, "%c", c);
311 else
312 fprintf(f, "\\x%02x", c);
313 }
314 }
315 #endif
316
317
318
319 /*************************************************
320 * Execute a Regular Expression - DFA engine *
321 *************************************************/
322
323 /* This internal function applies a compiled pattern to a subject string,
324 starting at a given point, using a DFA engine. This function is called from the
325 external one, possibly multiple times if the pattern is not anchored. The
326 function calls itself recursively for some kinds of subpattern.
327
328 Arguments:
329 md the match_data block with fixed information
330 this_start_code the opening bracket of this subexpression's code
331 current_subject where we currently are in the subject string
332 start_offset start offset in the subject string
333 offsets vector to contain the matching string offsets
334 offsetcount size of same
335 workspace vector of workspace
336 wscount size of same
337 rlevel function call recursion level
338
339 Returns: > 0 => number of match offset pairs placed in offsets
340 = 0 => offsets overflowed; longest matches are present
341 -1 => failed to match
342 < -1 => some kind of unexpected problem
343
344 The following macros are used for adding states to the two state vectors (one
345 for the current character, one for the following character). */
346
347 #define ADD_ACTIVE(x,y) \
348 if (active_count++ < wscount) \
349 { \
350 next_active_state->offset = (x); \
351 next_active_state->count = (y); \
352 next_active_state++; \
353 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
354 } \
355 else return PCRE_ERROR_DFA_WSSIZE
356
357 #define ADD_ACTIVE_DATA(x,y,z) \
358 if (active_count++ < wscount) \
359 { \
360 next_active_state->offset = (x); \
361 next_active_state->count = (y); \
362 next_active_state->data = (z); \
363 next_active_state++; \
364 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
365 } \
366 else return PCRE_ERROR_DFA_WSSIZE
367
368 #define ADD_NEW(x,y) \
369 if (new_count++ < wscount) \
370 { \
371 next_new_state->offset = (x); \
372 next_new_state->count = (y); \
373 next_new_state++; \
374 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
375 } \
376 else return PCRE_ERROR_DFA_WSSIZE
377
378 #define ADD_NEW_DATA(x,y,z) \
379 if (new_count++ < wscount) \
380 { \
381 next_new_state->offset = (x); \
382 next_new_state->count = (y); \
383 next_new_state->data = (z); \
384 next_new_state++; \
385 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
386 } \
387 else return PCRE_ERROR_DFA_WSSIZE
388
389 /* And now, here is the code */
390
391 static int
392 internal_dfa_exec(
393 dfa_match_data *md,
394 const pcre_uchar *this_start_code,
395 const pcre_uchar *current_subject,
396 int start_offset,
397 int *offsets,
398 int offsetcount,
399 int *workspace,
400 int wscount,
401 int rlevel)
402 {
403 stateblock *active_states, *new_states, *temp_states;
404 stateblock *next_active_state, *next_new_state;
405
406 const pcre_uint8 *ctypes, *lcc, *fcc;
407 const pcre_uchar *ptr;
408 const pcre_uchar *end_code, *first_op;
409
410 dfa_recursion_info new_recursive;
411
412 int active_count, new_count, match_count;
413
414 /* Some fields in the md block are frequently referenced, so we load them into
415 independent variables in the hope that this will perform better. */
416
417 const pcre_uchar *start_subject = md->start_subject;
418 const pcre_uchar *end_subject = md->end_subject;
419 const pcre_uchar *start_code = md->start_code;
420
421 #ifdef SUPPORT_UTF
422 BOOL utf = (md->poptions & PCRE_UTF8) != 0;
423 #else
424 BOOL utf = FALSE;
425 #endif
426
427 rlevel++;
428 offsetcount &= (-2);
429
430 wscount -= 2;
431 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
432 (2 * INTS_PER_STATEBLOCK);
433
434 DPRINTF(("\n%.*s---------------------\n"
435 "%.*sCall to internal_dfa_exec f=%d\n",
436 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
437
438 ctypes = md->tables + ctypes_offset;
439 lcc = md->tables + lcc_offset;
440 fcc = md->tables + fcc_offset;
441
442 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
443
444 active_states = (stateblock *)(workspace + 2);
445 next_new_state = new_states = active_states + wscount;
446 new_count = 0;
447
448 first_op = this_start_code + 1 + LINK_SIZE +
449 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
450 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
451 ? IMM2_SIZE:0);
452
453 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
454 the alternative states onto the list, and find out where the end is. This
455 makes is possible to use this function recursively, when we want to stop at a
456 matching internal ket rather than at the end.
457
458 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
459 a backward assertion. In that case, we have to find out the maximum amount to
460 move back, and set up each alternative appropriately. */
461
462 if (*first_op == OP_REVERSE)
463 {
464 int max_back = 0;
465 int gone_back;
466
467 end_code = this_start_code;
468 do
469 {
470 int back = GET(end_code, 2+LINK_SIZE);
471 if (back > max_back) max_back = back;
472 end_code += GET(end_code, 1);
473 }
474 while (*end_code == OP_ALT);
475
476 /* If we can't go back the amount required for the longest lookbehind
477 pattern, go back as far as we can; some alternatives may still be viable. */
478
479 #ifdef SUPPORT_UTF
480 /* In character mode we have to step back character by character */
481
482 if (utf)
483 {
484 for (gone_back = 0; gone_back < max_back; gone_back++)
485 {
486 if (current_subject <= start_subject) break;
487 current_subject--;
488 ACROSSCHAR(current_subject > start_subject, *current_subject, current_subject--);
489 }
490 }
491 else
492 #endif
493
494 /* In byte-mode we can do this quickly. */
495
496 {
497 gone_back = (current_subject - max_back < start_subject)?
498 (int)(current_subject - start_subject) : max_back;
499 current_subject -= gone_back;
500 }
501
502 /* Save the earliest consulted character */
503
504 if (current_subject < md->start_used_ptr)
505 md->start_used_ptr = current_subject;
506
507 /* Now we can process the individual branches. */
508
509 end_code = this_start_code;
510 do
511 {
512 int back = GET(end_code, 2+LINK_SIZE);
513 if (back <= gone_back)
514 {
515 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
516 ADD_NEW_DATA(-bstate, 0, gone_back - back);
517 }
518 end_code += GET(end_code, 1);
519 }
520 while (*end_code == OP_ALT);
521 }
522
523 /* This is the code for a "normal" subpattern (not a backward assertion). The
524 start of a whole pattern is always one of these. If we are at the top level,
525 we may be asked to restart matching from the same point that we reached for a
526 previous partial match. We still have to scan through the top-level branches to
527 find the end state. */
528
529 else
530 {
531 end_code = this_start_code;
532
533 /* Restarting */
534
535 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
536 {
537 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
538 new_count = workspace[1];
539 if (!workspace[0])
540 memcpy(new_states, active_states, new_count * sizeof(stateblock));
541 }
542
543 /* Not restarting */
544
545 else
546 {
547 int length = 1 + LINK_SIZE +
548 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
549 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)
550 ? IMM2_SIZE:0);
551 do
552 {
553 ADD_NEW((int)(end_code - start_code + length), 0);
554 end_code += GET(end_code, 1);
555 length = 1 + LINK_SIZE;
556 }
557 while (*end_code == OP_ALT);
558 }
559 }
560
561 workspace[0] = 0; /* Bit indicating which vector is current */
562
563 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, (int)(end_code - start_code)));
564
565 /* Loop for scanning the subject */
566
567 ptr = current_subject;
568 for (;;)
569 {
570 int i, j;
571 int clen, dlen;
572 unsigned int c, d;
573 int forced_fail = 0;
574 BOOL could_continue = FALSE;
575
576 /* Make the new state list into the active state list and empty the
577 new state list. */
578
579 temp_states = active_states;
580 active_states = new_states;
581 new_states = temp_states;
582 active_count = new_count;
583 new_count = 0;
584
585 workspace[0] ^= 1; /* Remember for the restarting feature */
586 workspace[1] = active_count;
587
588 #ifdef PCRE_DEBUG
589 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
590 pchars(ptr, STRLEN_UC(ptr), stdout);
591 printf("\"\n");
592
593 printf("%.*sActive states: ", rlevel*2-2, SP);
594 for (i = 0; i < active_count; i++)
595 printf("%d/%d ", active_states[i].offset, active_states[i].count);
596 printf("\n");
597 #endif
598
599 /* Set the pointers for adding new states */
600
601 next_active_state = active_states + active_count;
602 next_new_state = new_states;
603
604 /* Load the current character from the subject outside the loop, as many
605 different states may want to look at it, and we assume that at least one
606 will. */
607
608 if (ptr < end_subject)
609 {
610 clen = 1; /* Number of bytes in the character */
611 #ifdef SUPPORT_UTF
612 if (utf) { GETCHARLEN(c, ptr, clen); } else
613 #endif /* SUPPORT_UTF */
614 c = *ptr;
615 }
616 else
617 {
618 clen = 0; /* This indicates the end of the subject */
619 c = NOTACHAR; /* This value should never actually be used */
620 }
621
622 /* Scan up the active states and act on each one. The result of an action
623 may be to add more states to the currently active list (e.g. on hitting a
624 parenthesis) or it may be to put states on the new list, for considering
625 when we move the character pointer on. */
626
627 for (i = 0; i < active_count; i++)
628 {
629 stateblock *current_state = active_states + i;
630 BOOL caseless = FALSE;
631 const pcre_uchar *code;
632 int state_offset = current_state->offset;
633 int count, codevalue, rrc;
634
635 #ifdef PCRE_DEBUG
636 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
637 if (clen == 0) printf("EOL\n");
638 else if (c > 32 && c < 127) printf("'%c'\n", c);
639 else printf("0x%02x\n", c);
640 #endif
641
642 /* A negative offset is a special case meaning "hold off going to this
643 (negated) state until the number of characters in the data field have
644 been skipped". */
645
646 if (state_offset < 0)
647 {
648 if (current_state->data > 0)
649 {
650 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
651 ADD_NEW_DATA(state_offset, current_state->count,
652 current_state->data - 1);
653 continue;
654 }
655 else
656 {
657 current_state->offset = state_offset = -state_offset;
658 }
659 }
660
661 /* Check for a duplicate state with the same count, and skip if found.
662 See the note at the head of this module about the possibility of improving
663 performance here. */
664
665 for (j = 0; j < i; j++)
666 {
667 if (active_states[j].offset == state_offset &&
668 active_states[j].count == current_state->count)
669 {
670 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
671 goto NEXT_ACTIVE_STATE;
672 }
673 }
674
675 /* The state offset is the offset to the opcode */
676
677 code = start_code + state_offset;
678 codevalue = *code;
679
680 /* If this opcode inspects a character, but we are at the end of the
681 subject, remember the fact for use when testing for a partial match. */
682
683 if (clen == 0 && poptable[codevalue] != 0)
684 could_continue = TRUE;
685
686 /* If this opcode is followed by an inline character, load it. It is
687 tempting to test for the presence of a subject character here, but that
688 is wrong, because sometimes zero repetitions of the subject are
689 permitted.
690
691 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
692 argument that is not a data character - but is always one byte long. We
693 have to take special action to deal with \P, \p, \H, \h, \V, \v and \X in
694 this case. To keep the other cases fast, convert these ones to new opcodes.
695 */
696
697 if (coptable[codevalue] > 0)
698 {
699 dlen = 1;
700 #ifdef SUPPORT_UTF
701 if (utf) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
702 #endif /* SUPPORT_UTF */
703 d = code[coptable[codevalue]];
704 if (codevalue >= OP_TYPESTAR)
705 {
706 switch(d)
707 {
708 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
709 case OP_NOTPROP:
710 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
711 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
712 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
713 case OP_NOT_HSPACE:
714 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
715 case OP_NOT_VSPACE:
716 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
717 default: break;
718 }
719 }
720 }
721 else
722 {
723 dlen = 0; /* Not strictly necessary, but compilers moan */
724 d = NOTACHAR; /* if these variables are not set. */
725 }
726
727
728 /* Now process the individual opcodes */
729
730 switch (codevalue)
731 {
732 /* ========================================================================== */
733 /* These cases are never obeyed. This is a fudge that causes a compile-
734 time error if the vectors coptable or poptable, which are indexed by
735 opcode, are not the correct length. It seems to be the only way to do
736 such a check at compile time, as the sizeof() operator does not work
737 in the C preprocessor. */
738
739 case OP_TABLE_LENGTH:
740 case OP_TABLE_LENGTH +
741 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
742 (sizeof(poptable) == OP_TABLE_LENGTH)):
743 break;
744
745 /* ========================================================================== */
746 /* Reached a closing bracket. If not at the end of the pattern, carry
747 on with the next opcode. For repeating opcodes, also add the repeat
748 state. Note that KETRPOS will always be encountered at the end of the
749 subpattern, because the possessive subpattern repeats are always handled
750 using recursive calls. Thus, it never adds any new states.
751
752 At the end of the (sub)pattern, unless we have an empty string and
753 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
754 start of the subject, save the match data, shifting up all previous
755 matches so we always have the longest first. */
756
757 case OP_KET:
758 case OP_KETRMIN:
759 case OP_KETRMAX:
760 case OP_KETRPOS:
761 if (code != end_code)
762 {
763 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
764 if (codevalue != OP_KET)
765 {
766 ADD_ACTIVE(state_offset - GET(code, 1), 0);
767 }
768 }
769 else
770 {
771 if (ptr > current_subject ||
772 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
773 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
774 current_subject > start_subject + md->start_offset)))
775 {
776 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
777 else if (match_count > 0 && ++match_count * 2 > offsetcount)
778 match_count = 0;
779 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
780 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
781 if (offsetcount >= 2)
782 {
783 offsets[0] = (int)(current_subject - start_subject);
784 offsets[1] = (int)(ptr - start_subject);
785 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
786 offsets[1] - offsets[0], current_subject));
787 }
788 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
789 {
790 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
791 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
792 match_count, rlevel*2-2, SP));
793 return match_count;
794 }
795 }
796 }
797 break;
798
799 /* ========================================================================== */
800 /* These opcodes add to the current list of states without looking
801 at the current character. */
802
803 /*-----------------------------------------------------------------*/
804 case OP_ALT:
805 do { code += GET(code, 1); } while (*code == OP_ALT);
806 ADD_ACTIVE((int)(code - start_code), 0);
807 break;
808
809 /*-----------------------------------------------------------------*/
810 case OP_BRA:
811 case OP_SBRA:
812 do
813 {
814 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
815 code += GET(code, 1);
816 }
817 while (*code == OP_ALT);
818 break;
819
820 /*-----------------------------------------------------------------*/
821 case OP_CBRA:
822 case OP_SCBRA:
823 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE + IMM2_SIZE), 0);
824 code += GET(code, 1);
825 while (*code == OP_ALT)
826 {
827 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
828 code += GET(code, 1);
829 }
830 break;
831
832 /*-----------------------------------------------------------------*/
833 case OP_BRAZERO:
834 case OP_BRAMINZERO:
835 ADD_ACTIVE(state_offset + 1, 0);
836 code += 1 + GET(code, 2);
837 while (*code == OP_ALT) code += GET(code, 1);
838 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
839 break;
840
841 /*-----------------------------------------------------------------*/
842 case OP_SKIPZERO:
843 code += 1 + GET(code, 2);
844 while (*code == OP_ALT) code += GET(code, 1);
845 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
846 break;
847
848 /*-----------------------------------------------------------------*/
849 case OP_CIRC:
850 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
851 { ADD_ACTIVE(state_offset + 1, 0); }
852 break;
853
854 /*-----------------------------------------------------------------*/
855 case OP_CIRCM:
856 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
857 (ptr != end_subject && WAS_NEWLINE(ptr)))
858 { ADD_ACTIVE(state_offset + 1, 0); }
859 break;
860
861 /*-----------------------------------------------------------------*/
862 case OP_EOD:
863 if (ptr >= end_subject)
864 {
865 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
866 could_continue = TRUE;
867 else { ADD_ACTIVE(state_offset + 1, 0); }
868 }
869 break;
870
871 /*-----------------------------------------------------------------*/
872 case OP_SOD:
873 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
874 break;
875
876 /*-----------------------------------------------------------------*/
877 case OP_SOM:
878 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
879 break;
880
881
882 /* ========================================================================== */
883 /* These opcodes inspect the next subject character, and sometimes
884 the previous one as well, but do not have an argument. The variable
885 clen contains the length of the current character and is zero if we are
886 at the end of the subject. */
887
888 /*-----------------------------------------------------------------*/
889 case OP_ANY:
890 if (clen > 0 && !IS_NEWLINE(ptr))
891 { ADD_NEW(state_offset + 1, 0); }
892 break;
893
894 /*-----------------------------------------------------------------*/
895 case OP_ALLANY:
896 if (clen > 0)
897 { ADD_NEW(state_offset + 1, 0); }
898 break;
899
900 /*-----------------------------------------------------------------*/
901 case OP_EODN:
902 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
903 could_continue = TRUE;
904 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
905 { ADD_ACTIVE(state_offset + 1, 0); }
906 break;
907
908 /*-----------------------------------------------------------------*/
909 case OP_DOLL:
910 if ((md->moptions & PCRE_NOTEOL) == 0)
911 {
912 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
913 could_continue = TRUE;
914 else if (clen == 0 ||
915 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
916 (ptr == end_subject - md->nllen)
917 ))
918 { ADD_ACTIVE(state_offset + 1, 0); }
919 }
920 break;
921
922 /*-----------------------------------------------------------------*/
923 case OP_DOLLM:
924 if ((md->moptions & PCRE_NOTEOL) == 0)
925 {
926 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
927 could_continue = TRUE;
928 else if (clen == 0 ||
929 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
930 { ADD_ACTIVE(state_offset + 1, 0); }
931 }
932 else if (IS_NEWLINE(ptr))
933 { ADD_ACTIVE(state_offset + 1, 0); }
934 break;
935
936 /*-----------------------------------------------------------------*/
937
938 case OP_DIGIT:
939 case OP_WHITESPACE:
940 case OP_WORDCHAR:
941 if (clen > 0 && c < 256 &&
942 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
943 { ADD_NEW(state_offset + 1, 0); }
944 break;
945
946 /*-----------------------------------------------------------------*/
947 case OP_NOT_DIGIT:
948 case OP_NOT_WHITESPACE:
949 case OP_NOT_WORDCHAR:
950 if (clen > 0 && (c >= 256 ||
951 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
952 { ADD_NEW(state_offset + 1, 0); }
953 break;
954
955 /*-----------------------------------------------------------------*/
956 case OP_WORD_BOUNDARY:
957 case OP_NOT_WORD_BOUNDARY:
958 {
959 int left_word, right_word;
960
961 if (ptr > start_subject)
962 {
963 const pcre_uchar *temp = ptr - 1;
964 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
965 #ifdef SUPPORT_UTF
966 if (utf) { BACKCHAR(temp); }
967 #endif
968 GETCHARTEST(d, temp);
969 #ifdef SUPPORT_UCP
970 if ((md->poptions & PCRE_UCP) != 0)
971 {
972 if (d == '_') left_word = TRUE; else
973 {
974 int cat = UCD_CATEGORY(d);
975 left_word = (cat == ucp_L || cat == ucp_N);
976 }
977 }
978 else
979 #endif
980 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
981 }
982 else left_word = FALSE;
983
984 if (clen > 0)
985 {
986 #ifdef SUPPORT_UCP
987 if ((md->poptions & PCRE_UCP) != 0)
988 {
989 if (c == '_') right_word = TRUE; else
990 {
991 int cat = UCD_CATEGORY(c);
992 right_word = (cat == ucp_L || cat == ucp_N);
993 }
994 }
995 else
996 #endif
997 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
998 }
999 else right_word = FALSE;
1000
1001 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
1002 { ADD_ACTIVE(state_offset + 1, 0); }
1003 }
1004 break;
1005
1006
1007 /*-----------------------------------------------------------------*/
1008 /* Check the next character by Unicode property. We will get here only
1009 if the support is in the binary; otherwise a compile-time error occurs.
1010 */
1011
1012 #ifdef SUPPORT_UCP
1013 case OP_PROP:
1014 case OP_NOTPROP:
1015 if (clen > 0)
1016 {
1017 BOOL OK;
1018 const ucd_record * prop = GET_UCD(c);
1019 switch(code[1])
1020 {
1021 case PT_ANY:
1022 OK = TRUE;
1023 break;
1024
1025 case PT_LAMP:
1026 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1027 prop->chartype == ucp_Lt;
1028 break;
1029
1030 case PT_GC:
1031 OK = PRIV(ucp_gentype)[prop->chartype] == code[2];
1032 break;
1033
1034 case PT_PC:
1035 OK = prop->chartype == code[2];
1036 break;
1037
1038 case PT_SC:
1039 OK = prop->script == code[2];
1040 break;
1041
1042 /* These are specials for combination cases. */
1043
1044 case PT_ALNUM:
1045 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1046 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1047 break;
1048
1049 case PT_SPACE: /* Perl space */
1050 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1051 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1052 break;
1053
1054 case PT_PXSPACE: /* POSIX space */
1055 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1056 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1057 c == CHAR_FF || c == CHAR_CR;
1058 break;
1059
1060 case PT_WORD:
1061 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1062 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1063 c == CHAR_UNDERSCORE;
1064 break;
1065
1066 /* Should never occur, but keep compilers from grumbling. */
1067
1068 default:
1069 OK = codevalue != OP_PROP;
1070 break;
1071 }
1072
1073 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1074 }
1075 break;
1076 #endif
1077
1078
1079
1080 /* ========================================================================== */
1081 /* These opcodes likewise inspect the subject character, but have an
1082 argument that is not a data character. It is one of these opcodes:
1083 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1084 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1085
1086 case OP_TYPEPLUS:
1087 case OP_TYPEMINPLUS:
1088 case OP_TYPEPOSPLUS:
1089 count = current_state->count; /* Already matched */
1090 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1091 if (clen > 0)
1092 {
1093 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1094 (c < 256 &&
1095 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1096 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1097 {
1098 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1099 {
1100 active_count--; /* Remove non-match possibility */
1101 next_active_state--;
1102 }
1103 count++;
1104 ADD_NEW(state_offset, count);
1105 }
1106 }
1107 break;
1108
1109 /*-----------------------------------------------------------------*/
1110 case OP_TYPEQUERY:
1111 case OP_TYPEMINQUERY:
1112 case OP_TYPEPOSQUERY:
1113 ADD_ACTIVE(state_offset + 2, 0);
1114 if (clen > 0)
1115 {
1116 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1117 (c < 256 &&
1118 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1119 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1120 {
1121 if (codevalue == OP_TYPEPOSQUERY)
1122 {
1123 active_count--; /* Remove non-match possibility */
1124 next_active_state--;
1125 }
1126 ADD_NEW(state_offset + 2, 0);
1127 }
1128 }
1129 break;
1130
1131 /*-----------------------------------------------------------------*/
1132 case OP_TYPESTAR:
1133 case OP_TYPEMINSTAR:
1134 case OP_TYPEPOSSTAR:
1135 ADD_ACTIVE(state_offset + 2, 0);
1136 if (clen > 0)
1137 {
1138 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1139 (c < 256 &&
1140 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1141 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1142 {
1143 if (codevalue == OP_TYPEPOSSTAR)
1144 {
1145 active_count--; /* Remove non-match possibility */
1146 next_active_state--;
1147 }
1148 ADD_NEW(state_offset, 0);
1149 }
1150 }
1151 break;
1152
1153 /*-----------------------------------------------------------------*/
1154 case OP_TYPEEXACT:
1155 count = current_state->count; /* Number already matched */
1156 if (clen > 0)
1157 {
1158 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1159 (c < 256 &&
1160 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1161 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1162 {
1163 if (++count >= GET2(code, 1))
1164 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 1, 0); }
1165 else
1166 { ADD_NEW(state_offset, count); }
1167 }
1168 }
1169 break;
1170
1171 /*-----------------------------------------------------------------*/
1172 case OP_TYPEUPTO:
1173 case OP_TYPEMINUPTO:
1174 case OP_TYPEPOSUPTO:
1175 ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0);
1176 count = current_state->count; /* Number already matched */
1177 if (clen > 0)
1178 {
1179 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1180 (c < 256 &&
1181 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1182 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1183 {
1184 if (codevalue == OP_TYPEPOSUPTO)
1185 {
1186 active_count--; /* Remove non-match possibility */
1187 next_active_state--;
1188 }
1189 if (++count >= GET2(code, 1))
1190 { ADD_NEW(state_offset + 2 + IMM2_SIZE, 0); }
1191 else
1192 { ADD_NEW(state_offset, count); }
1193 }
1194 }
1195 break;
1196
1197 /* ========================================================================== */
1198 /* These are virtual opcodes that are used when something like
1199 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1200 argument. It keeps the code above fast for the other cases. The argument
1201 is in the d variable. */
1202
1203 #ifdef SUPPORT_UCP
1204 case OP_PROP_EXTRA + OP_TYPEPLUS:
1205 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1206 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1207 count = current_state->count; /* Already matched */
1208 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1209 if (clen > 0)
1210 {
1211 BOOL OK;
1212 const ucd_record * prop = GET_UCD(c);
1213 switch(code[2])
1214 {
1215 case PT_ANY:
1216 OK = TRUE;
1217 break;
1218
1219 case PT_LAMP:
1220 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1221 prop->chartype == ucp_Lt;
1222 break;
1223
1224 case PT_GC:
1225 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1226 break;
1227
1228 case PT_PC:
1229 OK = prop->chartype == code[3];
1230 break;
1231
1232 case PT_SC:
1233 OK = prop->script == code[3];
1234 break;
1235
1236 /* These are specials for combination cases. */
1237
1238 case PT_ALNUM:
1239 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1240 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1241 break;
1242
1243 case PT_SPACE: /* Perl space */
1244 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1245 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1246 break;
1247
1248 case PT_PXSPACE: /* POSIX space */
1249 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1250 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1251 c == CHAR_FF || c == CHAR_CR;
1252 break;
1253
1254 case PT_WORD:
1255 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1256 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1257 c == CHAR_UNDERSCORE;
1258 break;
1259
1260 /* Should never occur, but keep compilers from grumbling. */
1261
1262 default:
1263 OK = codevalue != OP_PROP;
1264 break;
1265 }
1266
1267 if (OK == (d == OP_PROP))
1268 {
1269 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1270 {
1271 active_count--; /* Remove non-match possibility */
1272 next_active_state--;
1273 }
1274 count++;
1275 ADD_NEW(state_offset, count);
1276 }
1277 }
1278 break;
1279
1280 /*-----------------------------------------------------------------*/
1281 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1282 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1283 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1284 count = current_state->count; /* Already matched */
1285 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1286 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1287 {
1288 const pcre_uchar *nptr = ptr + clen;
1289 int ncount = 0;
1290 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1291 {
1292 active_count--; /* Remove non-match possibility */
1293 next_active_state--;
1294 }
1295 while (nptr < end_subject)
1296 {
1297 int nd;
1298 int ndlen = 1;
1299 GETCHARLEN(nd, nptr, ndlen);
1300 if (UCD_CATEGORY(nd) != ucp_M) break;
1301 ncount++;
1302 nptr += ndlen;
1303 }
1304 count++;
1305 ADD_NEW_DATA(-state_offset, count, ncount);
1306 }
1307 break;
1308 #endif
1309
1310 /*-----------------------------------------------------------------*/
1311 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1312 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1313 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1314 count = current_state->count; /* Already matched */
1315 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1316 if (clen > 0)
1317 {
1318 int ncount = 0;
1319 switch (c)
1320 {
1321 case 0x000b:
1322 case 0x000c:
1323 case 0x0085:
1324 case 0x2028:
1325 case 0x2029:
1326 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1327 goto ANYNL01;
1328
1329 case 0x000d:
1330 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1331 /* Fall through */
1332
1333 ANYNL01:
1334 case 0x000a:
1335 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1336 {
1337 active_count--; /* Remove non-match possibility */
1338 next_active_state--;
1339 }
1340 count++;
1341 ADD_NEW_DATA(-state_offset, count, ncount);
1342 break;
1343
1344 default:
1345 break;
1346 }
1347 }
1348 break;
1349
1350 /*-----------------------------------------------------------------*/
1351 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1352 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1353 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1354 count = current_state->count; /* Already matched */
1355 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1356 if (clen > 0)
1357 {
1358 BOOL OK;
1359 switch (c)
1360 {
1361 case 0x000a:
1362 case 0x000b:
1363 case 0x000c:
1364 case 0x000d:
1365 case 0x0085:
1366 case 0x2028:
1367 case 0x2029:
1368 OK = TRUE;
1369 break;
1370
1371 default:
1372 OK = FALSE;
1373 break;
1374 }
1375
1376 if (OK == (d == OP_VSPACE))
1377 {
1378 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1379 {
1380 active_count--; /* Remove non-match possibility */
1381 next_active_state--;
1382 }
1383 count++;
1384 ADD_NEW_DATA(-state_offset, count, 0);
1385 }
1386 }
1387 break;
1388
1389 /*-----------------------------------------------------------------*/
1390 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1391 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1392 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1393 count = current_state->count; /* Already matched */
1394 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1395 if (clen > 0)
1396 {
1397 BOOL OK;
1398 switch (c)
1399 {
1400 case 0x09: /* HT */
1401 case 0x20: /* SPACE */
1402 case 0xa0: /* NBSP */
1403 case 0x1680: /* OGHAM SPACE MARK */
1404 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1405 case 0x2000: /* EN QUAD */
1406 case 0x2001: /* EM QUAD */
1407 case 0x2002: /* EN SPACE */
1408 case 0x2003: /* EM SPACE */
1409 case 0x2004: /* THREE-PER-EM SPACE */
1410 case 0x2005: /* FOUR-PER-EM SPACE */
1411 case 0x2006: /* SIX-PER-EM SPACE */
1412 case 0x2007: /* FIGURE SPACE */
1413 case 0x2008: /* PUNCTUATION SPACE */
1414 case 0x2009: /* THIN SPACE */
1415 case 0x200A: /* HAIR SPACE */
1416 case 0x202f: /* NARROW NO-BREAK SPACE */
1417 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1418 case 0x3000: /* IDEOGRAPHIC SPACE */
1419 OK = TRUE;
1420 break;
1421
1422 default:
1423 OK = FALSE;
1424 break;
1425 }
1426
1427 if (OK == (d == OP_HSPACE))
1428 {
1429 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1430 {
1431 active_count--; /* Remove non-match possibility */
1432 next_active_state--;
1433 }
1434 count++;
1435 ADD_NEW_DATA(-state_offset, count, 0);
1436 }
1437 }
1438 break;
1439
1440 /*-----------------------------------------------------------------*/
1441 #ifdef SUPPORT_UCP
1442 case OP_PROP_EXTRA + OP_TYPEQUERY:
1443 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1444 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1445 count = 4;
1446 goto QS1;
1447
1448 case OP_PROP_EXTRA + OP_TYPESTAR:
1449 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1450 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1451 count = 0;
1452
1453 QS1:
1454
1455 ADD_ACTIVE(state_offset + 4, 0);
1456 if (clen > 0)
1457 {
1458 BOOL OK;
1459 const ucd_record * prop = GET_UCD(c);
1460 switch(code[2])
1461 {
1462 case PT_ANY:
1463 OK = TRUE;
1464 break;
1465
1466 case PT_LAMP:
1467 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1468 prop->chartype == ucp_Lt;
1469 break;
1470
1471 case PT_GC:
1472 OK = PRIV(ucp_gentype)[prop->chartype] == code[3];
1473 break;
1474
1475 case PT_PC:
1476 OK = prop->chartype == code[3];
1477 break;
1478
1479 case PT_SC:
1480 OK = prop->script == code[3];
1481 break;
1482
1483 /* These are specials for combination cases. */
1484
1485 case PT_ALNUM:
1486 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1487 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1488 break;
1489
1490 case PT_SPACE: /* Perl space */
1491 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1492 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1493 break;
1494
1495 case PT_PXSPACE: /* POSIX space */
1496 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1497 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1498 c == CHAR_FF || c == CHAR_CR;
1499 break;
1500
1501 case PT_WORD:
1502 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1503 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1504 c == CHAR_UNDERSCORE;
1505 break;
1506
1507 /* Should never occur, but keep compilers from grumbling. */
1508
1509 default:
1510 OK = codevalue != OP_PROP;
1511 break;
1512 }
1513
1514 if (OK == (d == OP_PROP))
1515 {
1516 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1517 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1518 {
1519 active_count--; /* Remove non-match possibility */
1520 next_active_state--;
1521 }
1522 ADD_NEW(state_offset + count, 0);
1523 }
1524 }
1525 break;
1526
1527 /*-----------------------------------------------------------------*/
1528 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1529 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1530 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1531 count = 2;
1532 goto QS2;
1533
1534 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1535 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1536 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1537 count = 0;
1538
1539 QS2:
1540
1541 ADD_ACTIVE(state_offset + 2, 0);
1542 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1543 {
1544 const pcre_uchar *nptr = ptr + clen;
1545 int ncount = 0;
1546 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1547 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1548 {
1549 active_count--; /* Remove non-match possibility */
1550 next_active_state--;
1551 }
1552 while (nptr < end_subject)
1553 {
1554 int nd;
1555 int ndlen = 1;
1556 GETCHARLEN(nd, nptr, ndlen);
1557 if (UCD_CATEGORY(nd) != ucp_M) break;
1558 ncount++;
1559 nptr += ndlen;
1560 }
1561 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1562 }
1563 break;
1564 #endif
1565
1566 /*-----------------------------------------------------------------*/
1567 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1568 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1569 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1570 count = 2;
1571 goto QS3;
1572
1573 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1574 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1575 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1576 count = 0;
1577
1578 QS3:
1579 ADD_ACTIVE(state_offset + 2, 0);
1580 if (clen > 0)
1581 {
1582 int ncount = 0;
1583 switch (c)
1584 {
1585 case 0x000b:
1586 case 0x000c:
1587 case 0x0085:
1588 case 0x2028:
1589 case 0x2029:
1590 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1591 goto ANYNL02;
1592
1593 case 0x000d:
1594 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1595 /* Fall through */
1596
1597 ANYNL02:
1598 case 0x000a:
1599 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1600 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1601 {
1602 active_count--; /* Remove non-match possibility */
1603 next_active_state--;
1604 }
1605 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1606 break;
1607
1608 default:
1609 break;
1610 }
1611 }
1612 break;
1613
1614 /*-----------------------------------------------------------------*/
1615 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1616 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1617 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1618 count = 2;
1619 goto QS4;
1620
1621 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1622 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1623 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1624 count = 0;
1625
1626 QS4:
1627 ADD_ACTIVE(state_offset + 2, 0);
1628 if (clen > 0)
1629 {
1630 BOOL OK;
1631 switch (c)
1632 {
1633 case 0x000a:
1634 case 0x000b:
1635 case 0x000c:
1636 case 0x000d:
1637 case 0x0085:
1638 case 0x2028:
1639 case 0x2029:
1640 OK = TRUE;
1641 break;
1642
1643 default:
1644 OK = FALSE;
1645 break;
1646 }
1647 if (OK == (d == OP_VSPACE))
1648 {
1649 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1650 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1651 {
1652 active_count--; /* Remove non-match possibility */
1653 next_active_state--;
1654 }
1655 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1656 }
1657 }
1658 break;
1659
1660 /*-----------------------------------------------------------------*/
1661 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1662 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1663 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1664 count = 2;
1665 goto QS5;
1666
1667 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1668 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1669 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1670 count = 0;
1671
1672 QS5:
1673 ADD_ACTIVE(state_offset + 2, 0);
1674 if (clen > 0)
1675 {
1676 BOOL OK;
1677 switch (c)
1678 {
1679 case 0x09: /* HT */
1680 case 0x20: /* SPACE */
1681 case 0xa0: /* NBSP */
1682 case 0x1680: /* OGHAM SPACE MARK */
1683 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1684 case 0x2000: /* EN QUAD */
1685 case 0x2001: /* EM QUAD */
1686 case 0x2002: /* EN SPACE */
1687 case 0x2003: /* EM SPACE */
1688 case 0x2004: /* THREE-PER-EM SPACE */
1689 case 0x2005: /* FOUR-PER-EM SPACE */
1690 case 0x2006: /* SIX-PER-EM SPACE */
1691 case 0x2007: /* FIGURE SPACE */
1692 case 0x2008: /* PUNCTUATION SPACE */
1693 case 0x2009: /* THIN SPACE */
1694 case 0x200A: /* HAIR SPACE */
1695 case 0x202f: /* NARROW NO-BREAK SPACE */
1696 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1697 case 0x3000: /* IDEOGRAPHIC SPACE */
1698 OK = TRUE;
1699 break;
1700
1701 default:
1702 OK = FALSE;
1703 break;
1704 }
1705
1706 if (OK == (d == OP_HSPACE))
1707 {
1708 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1709 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1710 {
1711 active_count--; /* Remove non-match possibility */
1712 next_active_state--;
1713 }
1714 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1715 }
1716 }
1717 break;
1718
1719 /*-----------------------------------------------------------------*/
1720 #ifdef SUPPORT_UCP
1721 case OP_PROP_EXTRA + OP_TYPEEXACT:
1722 case OP_PROP_EXTRA + OP_TYPEUPTO:
1723 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1724 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1725 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1726 { ADD_ACTIVE(state_offset + 1 + IMM2_SIZE + 3, 0); }
1727 count = current_state->count; /* Number already matched */
1728 if (clen > 0)
1729 {
1730 BOOL OK;
1731 const ucd_record * prop = GET_UCD(c);
1732 switch(code[1 + IMM2_SIZE + 1])
1733 {
1734 case PT_ANY:
1735 OK = TRUE;
1736 break;
1737
1738 case PT_LAMP:
1739 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1740 prop->chartype == ucp_Lt;
1741 break;
1742
1743 case PT_GC:
1744 OK = PRIV(ucp_gentype)[prop->chartype] == code[1 + IMM2_SIZE + 2];
1745 break;
1746
1747 case PT_PC:
1748 OK = prop->chartype == code[1 + IMM2_SIZE + 2];
1749 break;
1750
1751 case PT_SC:
1752 OK = prop->script == code[1 + IMM2_SIZE + 2];
1753 break;
1754
1755 /* These are specials for combination cases. */
1756
1757 case PT_ALNUM:
1758 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1759 PRIV(ucp_gentype)[prop->chartype] == ucp_N;
1760 break;
1761
1762 case PT_SPACE: /* Perl space */
1763 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1764 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1765 break;
1766
1767 case PT_PXSPACE: /* POSIX space */
1768 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
1769 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1770 c == CHAR_FF || c == CHAR_CR;
1771 break;
1772
1773 case PT_WORD:
1774 OK = PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
1775 PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
1776 c == CHAR_UNDERSCORE;
1777 break;
1778
1779 /* Should never occur, but keep compilers from grumbling. */
1780
1781 default:
1782 OK = codevalue != OP_PROP;
1783 break;
1784 }
1785
1786 if (OK == (d == OP_PROP))
1787 {
1788 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1789 {
1790 active_count--; /* Remove non-match possibility */
1791 next_active_state--;
1792 }
1793 if (++count >= GET2(code, 1))
1794 { ADD_NEW(state_offset + 1 + IMM2_SIZE + 3, 0); }
1795 else
1796 { ADD_NEW(state_offset, count); }
1797 }
1798 }
1799 break;
1800
1801 /*-----------------------------------------------------------------*/
1802 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1803 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1804 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1805 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1806 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1807 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1808 count = current_state->count; /* Number already matched */
1809 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1810 {
1811 const pcre_uchar *nptr = ptr + clen;
1812 int ncount = 0;
1813 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1814 {
1815 active_count--; /* Remove non-match possibility */
1816 next_active_state--;
1817 }
1818 while (nptr < end_subject)
1819 {
1820 int nd;
1821 int ndlen = 1;
1822 GETCHARLEN(nd, nptr, ndlen);
1823 if (UCD_CATEGORY(nd) != ucp_M) break;
1824 ncount++;
1825 nptr += ndlen;
1826 }
1827 if (++count >= GET2(code, 1))
1828 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1829 else
1830 { ADD_NEW_DATA(-state_offset, count, ncount); }
1831 }
1832 break;
1833 #endif
1834
1835 /*-----------------------------------------------------------------*/
1836 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1837 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1838 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1839 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1840 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1841 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1842 count = current_state->count; /* Number already matched */
1843 if (clen > 0)
1844 {
1845 int ncount = 0;
1846 switch (c)
1847 {
1848 case 0x000b:
1849 case 0x000c:
1850 case 0x0085:
1851 case 0x2028:
1852 case 0x2029:
1853 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1854 goto ANYNL03;
1855
1856 case 0x000d:
1857 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1858 /* Fall through */
1859
1860 ANYNL03:
1861 case 0x000a:
1862 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1863 {
1864 active_count--; /* Remove non-match possibility */
1865 next_active_state--;
1866 }
1867 if (++count >= GET2(code, 1))
1868 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, ncount); }
1869 else
1870 { ADD_NEW_DATA(-state_offset, count, ncount); }
1871 break;
1872
1873 default:
1874 break;
1875 }
1876 }
1877 break;
1878
1879 /*-----------------------------------------------------------------*/
1880 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1881 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1882 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1883 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1884 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1885 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1886 count = current_state->count; /* Number already matched */
1887 if (clen > 0)
1888 {
1889 BOOL OK;
1890 switch (c)
1891 {
1892 case 0x000a:
1893 case 0x000b:
1894 case 0x000c:
1895 case 0x000d:
1896 case 0x0085:
1897 case 0x2028:
1898 case 0x2029:
1899 OK = TRUE;
1900 break;
1901
1902 default:
1903 OK = FALSE;
1904 }
1905
1906 if (OK == (d == OP_VSPACE))
1907 {
1908 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1909 {
1910 active_count--; /* Remove non-match possibility */
1911 next_active_state--;
1912 }
1913 if (++count >= GET2(code, 1))
1914 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
1915 else
1916 { ADD_NEW_DATA(-state_offset, count, 0); }
1917 }
1918 }
1919 break;
1920
1921 /*-----------------------------------------------------------------*/
1922 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
1923 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
1924 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
1925 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
1926 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
1927 { ADD_ACTIVE(state_offset + 2 + IMM2_SIZE, 0); }
1928 count = current_state->count; /* Number already matched */
1929 if (clen > 0)
1930 {
1931 BOOL OK;
1932 switch (c)
1933 {
1934 case 0x09: /* HT */
1935 case 0x20: /* SPACE */
1936 case 0xa0: /* NBSP */
1937 case 0x1680: /* OGHAM SPACE MARK */
1938 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1939 case 0x2000: /* EN QUAD */
1940 case 0x2001: /* EM QUAD */
1941 case 0x2002: /* EN SPACE */
1942 case 0x2003: /* EM SPACE */
1943 case 0x2004: /* THREE-PER-EM SPACE */
1944 case 0x2005: /* FOUR-PER-EM SPACE */
1945 case 0x2006: /* SIX-PER-EM SPACE */
1946 case 0x2007: /* FIGURE SPACE */
1947 case 0x2008: /* PUNCTUATION SPACE */
1948 case 0x2009: /* THIN SPACE */
1949 case 0x200A: /* HAIR SPACE */
1950 case 0x202f: /* NARROW NO-BREAK SPACE */
1951 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1952 case 0x3000: /* IDEOGRAPHIC SPACE */
1953 OK = TRUE;
1954 break;
1955
1956 default:
1957 OK = FALSE;
1958 break;
1959 }
1960
1961 if (OK == (d == OP_HSPACE))
1962 {
1963 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
1964 {
1965 active_count--; /* Remove non-match possibility */
1966 next_active_state--;
1967 }
1968 if (++count >= GET2(code, 1))
1969 { ADD_NEW_DATA(-(state_offset + 2 + IMM2_SIZE), 0, 0); }
1970 else
1971 { ADD_NEW_DATA(-state_offset, count, 0); }
1972 }
1973 }
1974 break;
1975
1976 /* ========================================================================== */
1977 /* These opcodes are followed by a character that is usually compared
1978 to the current subject character; it is loaded into d. We still get
1979 here even if there is no subject character, because in some cases zero
1980 repetitions are permitted. */
1981
1982 /*-----------------------------------------------------------------*/
1983 case OP_CHAR:
1984 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
1985 break;
1986
1987 /*-----------------------------------------------------------------*/
1988 case OP_CHARI:
1989 if (clen == 0) break;
1990
1991 #ifdef SUPPORT_UTF
1992 if (utf)
1993 {
1994 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
1995 {
1996 unsigned int othercase;
1997 if (c < 128)
1998 othercase = fcc[c];
1999 else
2000 /* If we have Unicode property support, we can use it to test the
2001 other case of the character. */
2002 #ifdef SUPPORT_UCP
2003 othercase = UCD_OTHERCASE(c);
2004 #else
2005 othercase = NOTACHAR;
2006 #endif
2007
2008 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2009 }
2010 }
2011 else
2012 #endif /* SUPPORT_UTF */
2013 /* Not UTF mode */
2014 {
2015 if (TABLE_GET(c, lcc, c) == TABLE_GET(d, lcc, d))
2016 { ADD_NEW(state_offset + 2, 0); }
2017 }
2018 break;
2019
2020
2021 #ifdef SUPPORT_UCP
2022 /*-----------------------------------------------------------------*/
2023 /* This is a tricky one because it can match more than one character.
2024 Find out how many characters to skip, and then set up a negative state
2025 to wait for them to pass before continuing. */
2026
2027 case OP_EXTUNI:
2028 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
2029 {
2030 const pcre_uchar *nptr = ptr + clen;
2031 int ncount = 0;
2032 while (nptr < end_subject)
2033 {
2034 int nclen = 1;
2035 GETCHARLEN(c, nptr, nclen);
2036 if (UCD_CATEGORY(c) != ucp_M) break;
2037 ncount++;
2038 nptr += nclen;
2039 }
2040 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2041 }
2042 break;
2043 #endif
2044
2045 /*-----------------------------------------------------------------*/
2046 /* This is a tricky like EXTUNI because it too can match more than one
2047 character (when CR is followed by LF). In this case, set up a negative
2048 state to wait for one character to pass before continuing. */
2049
2050 case OP_ANYNL:
2051 if (clen > 0) switch(c)
2052 {
2053 case 0x000b:
2054 case 0x000c:
2055 case 0x0085:
2056 case 0x2028:
2057 case 0x2029:
2058 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2059
2060 case 0x000a:
2061 ADD_NEW(state_offset + 1, 0);
2062 break;
2063
2064 case 0x000d:
2065 if (ptr + 1 < end_subject && ptr[1] == 0x0a)
2066 {
2067 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2068 }
2069 else
2070 {
2071 ADD_NEW(state_offset + 1, 0);
2072 }
2073 break;
2074 }
2075 break;
2076
2077 /*-----------------------------------------------------------------*/
2078 case OP_NOT_VSPACE:
2079 if (clen > 0) switch(c)
2080 {
2081 case 0x000a:
2082 case 0x000b:
2083 case 0x000c:
2084 case 0x000d:
2085 case 0x0085:
2086 case 0x2028:
2087 case 0x2029:
2088 break;
2089
2090 default:
2091 ADD_NEW(state_offset + 1, 0);
2092 break;
2093 }
2094 break;
2095
2096 /*-----------------------------------------------------------------*/
2097 case OP_VSPACE:
2098 if (clen > 0) switch(c)
2099 {
2100 case 0x000a:
2101 case 0x000b:
2102 case 0x000c:
2103 case 0x000d:
2104 case 0x0085:
2105 case 0x2028:
2106 case 0x2029:
2107 ADD_NEW(state_offset + 1, 0);
2108 break;
2109
2110 default: break;
2111 }
2112 break;
2113
2114 /*-----------------------------------------------------------------*/
2115 case OP_NOT_HSPACE:
2116 if (clen > 0) switch(c)
2117 {
2118 case 0x09: /* HT */
2119 case 0x20: /* SPACE */
2120 case 0xa0: /* NBSP */
2121 case 0x1680: /* OGHAM SPACE MARK */
2122 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2123 case 0x2000: /* EN QUAD */
2124 case 0x2001: /* EM QUAD */
2125 case 0x2002: /* EN SPACE */
2126 case 0x2003: /* EM SPACE */
2127 case 0x2004: /* THREE-PER-EM SPACE */
2128 case 0x2005: /* FOUR-PER-EM SPACE */
2129 case 0x2006: /* SIX-PER-EM SPACE */
2130 case 0x2007: /* FIGURE SPACE */
2131 case 0x2008: /* PUNCTUATION SPACE */
2132 case 0x2009: /* THIN SPACE */
2133 case 0x200A: /* HAIR SPACE */
2134 case 0x202f: /* NARROW NO-BREAK SPACE */
2135 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2136 case 0x3000: /* IDEOGRAPHIC SPACE */
2137 break;
2138
2139 default:
2140 ADD_NEW(state_offset + 1, 0);
2141 break;
2142 }
2143 break;
2144
2145 /*-----------------------------------------------------------------*/
2146 case OP_HSPACE:
2147 if (clen > 0) switch(c)
2148 {
2149 case 0x09: /* HT */
2150 case 0x20: /* SPACE */
2151 case 0xa0: /* NBSP */
2152 case 0x1680: /* OGHAM SPACE MARK */
2153 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2154 case 0x2000: /* EN QUAD */
2155 case 0x2001: /* EM QUAD */
2156 case 0x2002: /* EN SPACE */
2157 case 0x2003: /* EM SPACE */
2158 case 0x2004: /* THREE-PER-EM SPACE */
2159 case 0x2005: /* FOUR-PER-EM SPACE */
2160 case 0x2006: /* SIX-PER-EM SPACE */
2161 case 0x2007: /* FIGURE SPACE */
2162 case 0x2008: /* PUNCTUATION SPACE */
2163 case 0x2009: /* THIN SPACE */
2164 case 0x200A: /* HAIR SPACE */
2165 case 0x202f: /* NARROW NO-BREAK SPACE */
2166 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2167 case 0x3000: /* IDEOGRAPHIC SPACE */
2168 ADD_NEW(state_offset + 1, 0);
2169 break;
2170 }
2171 break;
2172
2173 /*-----------------------------------------------------------------*/
2174 /* Match a negated single character casefully. This is only used for
2175 one-byte characters, that is, we know that d < 256. The character we are
2176 checking (c) can be multibyte. */
2177
2178 case OP_NOT:
2179 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2180 break;
2181
2182 /*-----------------------------------------------------------------*/
2183 /* Match a negated single character caselessly. This is only used for
2184 one-byte characters, that is, we know that d < 256. The character we are
2185 checking (c) can be multibyte. */
2186
2187 case OP_NOTI:
2188 if (clen > 0 && c != d && c != fcc[d])
2189 { ADD_NEW(state_offset + dlen + 1, 0); }
2190 break;
2191
2192 /*-----------------------------------------------------------------*/
2193 case OP_PLUSI:
2194 case OP_MINPLUSI:
2195 case OP_POSPLUSI:
2196 case OP_NOTPLUSI:
2197 case OP_NOTMINPLUSI:
2198 case OP_NOTPOSPLUSI:
2199 caseless = TRUE;
2200 codevalue -= OP_STARI - OP_STAR;
2201
2202 /* Fall through */
2203 case OP_PLUS:
2204 case OP_MINPLUS:
2205 case OP_POSPLUS:
2206 case OP_NOTPLUS:
2207 case OP_NOTMINPLUS:
2208 case OP_NOTPOSPLUS:
2209 count = current_state->count; /* Already matched */
2210 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2211 if (clen > 0)
2212 {
2213 unsigned int otherd = NOTACHAR;
2214 if (caseless)
2215 {
2216 #ifdef SUPPORT_UTF
2217 if (utf && d >= 128)
2218 {
2219 #ifdef SUPPORT_UCP
2220 otherd = UCD_OTHERCASE(d);
2221 #endif /* SUPPORT_UCP */
2222 }
2223 else
2224 #endif /* SUPPORT_UTF */
2225 otherd = TABLE_GET(d, fcc, d);
2226 }
2227 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2228 {
2229 if (count > 0 &&
2230 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2231 {
2232 active_count--; /* Remove non-match possibility */
2233 next_active_state--;
2234 }
2235 count++;
2236 ADD_NEW(state_offset, count);
2237 }
2238 }
2239 break;
2240
2241 /*-----------------------------------------------------------------*/
2242 case OP_QUERYI:
2243 case OP_MINQUERYI:
2244 case OP_POSQUERYI:
2245 case OP_NOTQUERYI:
2246 case OP_NOTMINQUERYI:
2247 case OP_NOTPOSQUERYI:
2248 caseless = TRUE;
2249 codevalue -= OP_STARI - OP_STAR;
2250 /* Fall through */
2251 case OP_QUERY:
2252 case OP_MINQUERY:
2253 case OP_POSQUERY:
2254 case OP_NOTQUERY:
2255 case OP_NOTMINQUERY:
2256 case OP_NOTPOSQUERY:
2257 ADD_ACTIVE(state_offset + dlen + 1, 0);
2258 if (clen > 0)
2259 {
2260 unsigned int otherd = NOTACHAR;
2261 if (caseless)
2262 {
2263 #ifdef SUPPORT_UTF
2264 if (utf && d >= 128)
2265 {
2266 #ifdef SUPPORT_UCP
2267 otherd = UCD_OTHERCASE(d);
2268 #endif /* SUPPORT_UCP */
2269 }
2270 else
2271 #endif /* SUPPORT_UTF */
2272 otherd = TABLE_GET(d, fcc, d);
2273 }
2274 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2275 {
2276 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2277 {
2278 active_count--; /* Remove non-match possibility */
2279 next_active_state--;
2280 }
2281 ADD_NEW(state_offset + dlen + 1, 0);
2282 }
2283 }
2284 break;
2285
2286 /*-----------------------------------------------------------------*/
2287 case OP_STARI:
2288 case OP_MINSTARI:
2289 case OP_POSSTARI:
2290 case OP_NOTSTARI:
2291 case OP_NOTMINSTARI:
2292 case OP_NOTPOSSTARI:
2293 caseless = TRUE;
2294 codevalue -= OP_STARI - OP_STAR;
2295 /* Fall through */
2296 case OP_STAR:
2297 case OP_MINSTAR:
2298 case OP_POSSTAR:
2299 case OP_NOTSTAR:
2300 case OP_NOTMINSTAR:
2301 case OP_NOTPOSSTAR:
2302 ADD_ACTIVE(state_offset + dlen + 1, 0);
2303 if (clen > 0)
2304 {
2305 unsigned int otherd = NOTACHAR;
2306 if (caseless)
2307 {
2308 #ifdef SUPPORT_UTF
2309 if (utf && d >= 128)
2310 {
2311 #ifdef SUPPORT_UCP
2312 otherd = UCD_OTHERCASE(d);
2313 #endif /* SUPPORT_UCP */
2314 }
2315 else
2316 #endif /* SUPPORT_UTF */
2317 otherd = TABLE_GET(d, fcc, d);
2318 }
2319 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2320 {
2321 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2322 {
2323 active_count--; /* Remove non-match possibility */
2324 next_active_state--;
2325 }
2326 ADD_NEW(state_offset, 0);
2327 }
2328 }
2329 break;
2330
2331 /*-----------------------------------------------------------------*/
2332 case OP_EXACTI:
2333 case OP_NOTEXACTI:
2334 caseless = TRUE;
2335 codevalue -= OP_STARI - OP_STAR;
2336 /* Fall through */
2337 case OP_EXACT:
2338 case OP_NOTEXACT:
2339 count = current_state->count; /* Number already matched */
2340 if (clen > 0)
2341 {
2342 unsigned int otherd = NOTACHAR;
2343 if (caseless)
2344 {
2345 #ifdef SUPPORT_UTF
2346 if (utf && d >= 128)
2347 {
2348 #ifdef SUPPORT_UCP
2349 otherd = UCD_OTHERCASE(d);
2350 #endif /* SUPPORT_UCP */
2351 }
2352 else
2353 #endif /* SUPPORT_UTF */
2354 otherd = TABLE_GET(d, fcc, d);
2355 }
2356 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2357 {
2358 if (++count >= GET2(code, 1))
2359 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2360 else
2361 { ADD_NEW(state_offset, count); }
2362 }
2363 }
2364 break;
2365
2366 /*-----------------------------------------------------------------*/
2367 case OP_UPTOI:
2368 case OP_MINUPTOI:
2369 case OP_POSUPTOI:
2370 case OP_NOTUPTOI:
2371 case OP_NOTMINUPTOI:
2372 case OP_NOTPOSUPTOI:
2373 caseless = TRUE;
2374 codevalue -= OP_STARI - OP_STAR;
2375 /* Fall through */
2376 case OP_UPTO:
2377 case OP_MINUPTO:
2378 case OP_POSUPTO:
2379 case OP_NOTUPTO:
2380 case OP_NOTMINUPTO:
2381 case OP_NOTPOSUPTO:
2382 ADD_ACTIVE(state_offset + dlen + 1 + IMM2_SIZE, 0);
2383 count = current_state->count; /* Number already matched */
2384 if (clen > 0)
2385 {
2386 unsigned int otherd = NOTACHAR;
2387 if (caseless)
2388 {
2389 #ifdef SUPPORT_UTF
2390 if (utf && d >= 128)
2391 {
2392 #ifdef SUPPORT_UCP
2393 otherd = UCD_OTHERCASE(d);
2394 #endif /* SUPPORT_UCP */
2395 }
2396 else
2397 #endif /* SUPPORT_UTF */
2398 otherd = TABLE_GET(d, fcc, d);
2399 }
2400 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2401 {
2402 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2403 {
2404 active_count--; /* Remove non-match possibility */
2405 next_active_state--;
2406 }
2407 if (++count >= GET2(code, 1))
2408 { ADD_NEW(state_offset + dlen + 1 + IMM2_SIZE, 0); }
2409 else
2410 { ADD_NEW(state_offset, count); }
2411 }
2412 }
2413 break;
2414
2415
2416 /* ========================================================================== */
2417 /* These are the class-handling opcodes */
2418
2419 case OP_CLASS:
2420 case OP_NCLASS:
2421 case OP_XCLASS:
2422 {
2423 BOOL isinclass = FALSE;
2424 int next_state_offset;
2425 const pcre_uchar *ecode;
2426
2427 /* For a simple class, there is always just a 32-byte table, and we
2428 can set isinclass from it. */
2429
2430 if (codevalue != OP_XCLASS)
2431 {
2432 ecode = code + 1 + (32 / sizeof(pcre_uchar));
2433 if (clen > 0)
2434 {
2435 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2436 ((((pcre_uint8 *)(code + 1))[c/8] & (1 << (c&7))) != 0);
2437 }
2438 }
2439
2440 /* An extended class may have a table or a list of single characters,
2441 ranges, or both, and it may be positive or negative. There's a
2442 function that sorts all this out. */
2443
2444 else
2445 {
2446 ecode = code + GET(code, 1);
2447 if (clen > 0) isinclass = PRIV(xclass)(c, code + 1 + LINK_SIZE, utf);
2448 }
2449
2450 /* At this point, isinclass is set for all kinds of class, and ecode
2451 points to the byte after the end of the class. If there is a
2452 quantifier, this is where it will be. */
2453
2454 next_state_offset = (int)(ecode - start_code);
2455
2456 switch (*ecode)
2457 {
2458 case OP_CRSTAR:
2459 case OP_CRMINSTAR:
2460 ADD_ACTIVE(next_state_offset + 1, 0);
2461 if (isinclass) { ADD_NEW(state_offset, 0); }
2462 break;
2463
2464 case OP_CRPLUS:
2465 case OP_CRMINPLUS:
2466 count = current_state->count; /* Already matched */
2467 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2468 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2469 break;
2470
2471 case OP_CRQUERY:
2472 case OP_CRMINQUERY:
2473 ADD_ACTIVE(next_state_offset + 1, 0);
2474 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2475 break;
2476
2477 case OP_CRRANGE:
2478 case OP_CRMINRANGE:
2479 count = current_state->count; /* Already matched */
2480 if (count >= GET2(ecode, 1))
2481 { ADD_ACTIVE(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2482 if (isinclass)
2483 {
2484 int max = GET2(ecode, 1 + IMM2_SIZE);
2485 if (++count >= max && max != 0) /* Max 0 => no limit */
2486 { ADD_NEW(next_state_offset + 1 + 2 * IMM2_SIZE, 0); }
2487 else
2488 { ADD_NEW(state_offset, count); }
2489 }
2490 break;
2491
2492 default:
2493 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2494 break;
2495 }
2496 }
2497 break;
2498
2499 /* ========================================================================== */
2500 /* These are the opcodes for fancy brackets of various kinds. We have
2501 to use recursion in order to handle them. The "always failing" assertion
2502 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2503 though the other "backtracking verbs" are not supported. */
2504
2505 case OP_FAIL:
2506 forced_fail++; /* Count FAILs for multiple states */
2507 break;
2508
2509 case OP_ASSERT:
2510 case OP_ASSERT_NOT:
2511 case OP_ASSERTBACK:
2512 case OP_ASSERTBACK_NOT:
2513 {
2514 int rc;
2515 int local_offsets[2];
2516 int local_workspace[1000];
2517 const pcre_uchar *endasscode = code + GET(code, 1);
2518
2519 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2520
2521 rc = internal_dfa_exec(
2522 md, /* static match data */
2523 code, /* this subexpression's code */
2524 ptr, /* where we currently are */
2525 (int)(ptr - start_subject), /* start offset */
2526 local_offsets, /* offset vector */
2527 sizeof(local_offsets)/sizeof(int), /* size of same */
2528 local_workspace, /* workspace vector */
2529 sizeof(local_workspace)/sizeof(int), /* size of same */
2530 rlevel); /* function recursion level */
2531
2532 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2533 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2534 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2535 }
2536 break;
2537
2538 /*-----------------------------------------------------------------*/
2539 case OP_COND:
2540 case OP_SCOND:
2541 {
2542 int local_offsets[1000];
2543 int local_workspace[1000];
2544 int codelink = GET(code, 1);
2545 int condcode;
2546
2547 /* Because of the way auto-callout works during compile, a callout item
2548 is inserted between OP_COND and an assertion condition. This does not
2549 happen for the other conditions. */
2550
2551 if (code[LINK_SIZE+1] == OP_CALLOUT)
2552 {
2553 rrc = 0;
2554 if (PUBL(callout) != NULL)
2555 {
2556 PUBL(callout_block) cb;
2557 cb.version = 1; /* Version 1 of the callout block */
2558 cb.callout_number = code[LINK_SIZE+2];
2559 cb.offset_vector = offsets;
2560 #ifdef COMPILE_PCRE8
2561 cb.subject = (PCRE_SPTR)start_subject;
2562 #else
2563 cb.subject = (PCRE_SPTR16)start_subject;
2564 #endif
2565 cb.subject_length = (int)(end_subject - start_subject);
2566 cb.start_match = (int)(current_subject - start_subject);
2567 cb.current_position = (int)(ptr - start_subject);
2568 cb.pattern_position = GET(code, LINK_SIZE + 3);
2569 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2570 cb.capture_top = 1;
2571 cb.capture_last = -1;
2572 cb.callout_data = md->callout_data;
2573 cb.mark = NULL; /* No (*MARK) support */
2574 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2575 }
2576 if (rrc > 0) break; /* Fail this thread */
2577 code += PRIV(OP_lengths)[OP_CALLOUT]; /* Skip callout data */
2578 }
2579
2580 condcode = code[LINK_SIZE+1];
2581
2582 /* Back reference conditions are not supported */
2583
2584 if (condcode == OP_CREF || condcode == OP_NCREF)
2585 return PCRE_ERROR_DFA_UCOND;
2586
2587 /* The DEFINE condition is always false */
2588
2589 if (condcode == OP_DEF)
2590 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2591
2592 /* The only supported version of OP_RREF is for the value RREF_ANY,
2593 which means "test if in any recursion". We can't test for specifically
2594 recursed groups. */
2595
2596 else if (condcode == OP_RREF || condcode == OP_NRREF)
2597 {
2598 int value = GET2(code, LINK_SIZE + 2);
2599 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2600 if (md->recursive != NULL)
2601 { ADD_ACTIVE(state_offset + LINK_SIZE + 2 + IMM2_SIZE, 0); }
2602 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2603 }
2604
2605 /* Otherwise, the condition is an assertion */
2606
2607 else
2608 {
2609 int rc;
2610 const pcre_uchar *asscode = code + LINK_SIZE + 1;
2611 const pcre_uchar *endasscode = asscode + GET(asscode, 1);
2612
2613 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2614
2615 rc = internal_dfa_exec(
2616 md, /* fixed match data */
2617 asscode, /* this subexpression's code */
2618 ptr, /* where we currently are */
2619 (int)(ptr - start_subject), /* start offset */
2620 local_offsets, /* offset vector */
2621 sizeof(local_offsets)/sizeof(int), /* size of same */
2622 local_workspace, /* workspace vector */
2623 sizeof(local_workspace)/sizeof(int), /* size of same */
2624 rlevel); /* function recursion level */
2625
2626 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2627 if ((rc >= 0) ==
2628 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2629 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2630 else
2631 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2632 }
2633 }
2634 break;
2635
2636 /*-----------------------------------------------------------------*/
2637 case OP_RECURSE:
2638 {
2639 dfa_recursion_info *ri;
2640 int local_offsets[1000];
2641 int local_workspace[1000];
2642 const pcre_uchar *callpat = start_code + GET(code, 1);
2643 int recno = (callpat == md->start_code)? 0 :
2644 GET2(callpat, 1 + LINK_SIZE);
2645 int rc;
2646
2647 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2648
2649 /* Check for repeating a recursion without advancing the subject
2650 pointer. This should catch convoluted mutual recursions. (Some simple
2651 cases are caught at compile time.) */
2652
2653 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2654 if (recno == ri->group_num && ptr == ri->subject_position)
2655 return PCRE_ERROR_RECURSELOOP;
2656
2657 /* Remember this recursion and where we started it so as to
2658 catch infinite loops. */
2659
2660 new_recursive.group_num = recno;
2661 new_recursive.subject_position = ptr;
2662 new_recursive.prevrec = md->recursive;
2663 md->recursive = &new_recursive;
2664
2665 rc = internal_dfa_exec(
2666 md, /* fixed match data */
2667 callpat, /* this subexpression's code */
2668 ptr, /* where we currently are */
2669 (int)(ptr - start_subject), /* start offset */
2670 local_offsets, /* offset vector */
2671 sizeof(local_offsets)/sizeof(int), /* size of same */
2672 local_workspace, /* workspace vector */
2673 sizeof(local_workspace)/sizeof(int), /* size of same */
2674 rlevel); /* function recursion level */
2675
2676 md->recursive = new_recursive.prevrec; /* Done this recursion */
2677
2678 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2679 rc));
2680
2681 /* Ran out of internal offsets */
2682
2683 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2684
2685 /* For each successful matched substring, set up the next state with a
2686 count of characters to skip before trying it. Note that the count is in
2687 characters, not bytes. */
2688
2689 if (rc > 0)
2690 {
2691 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2692 {
2693 int charcount = local_offsets[rc+1] - local_offsets[rc];
2694 #ifdef SUPPORT_UTF
2695 const pcre_uchar *p = start_subject + local_offsets[rc];
2696 const pcre_uchar *pp = start_subject + local_offsets[rc+1];
2697 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2698 #endif
2699 if (charcount > 0)
2700 {
2701 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2702 }
2703 else
2704 {
2705 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2706 }
2707 }
2708 }
2709 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2710 }
2711 break;
2712
2713 /*-----------------------------------------------------------------*/
2714 case OP_BRAPOS:
2715 case OP_SBRAPOS:
2716 case OP_CBRAPOS:
2717 case OP_SCBRAPOS:
2718 case OP_BRAPOSZERO:
2719 {
2720 int charcount, matched_count;
2721 const pcre_uchar *local_ptr = ptr;
2722 BOOL allow_zero;
2723
2724 if (codevalue == OP_BRAPOSZERO)
2725 {
2726 allow_zero = TRUE;
2727 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2728 }
2729 else allow_zero = FALSE;
2730
2731 /* Loop to match the subpattern as many times as possible as if it were
2732 a complete pattern. */
2733
2734 for (matched_count = 0;; matched_count++)
2735 {
2736 int local_offsets[2];
2737 int local_workspace[1000];
2738
2739 int rc = internal_dfa_exec(
2740 md, /* fixed match data */
2741 code, /* this subexpression's code */
2742 local_ptr, /* where we currently are */
2743 (int)(ptr - start_subject), /* start offset */
2744 local_offsets, /* offset vector */
2745 sizeof(local_offsets)/sizeof(int), /* size of same */
2746 local_workspace, /* workspace vector */
2747 sizeof(local_workspace)/sizeof(int), /* size of same */
2748 rlevel); /* function recursion level */
2749
2750 /* Failed to match */
2751
2752 if (rc < 0)
2753 {
2754 if (rc != PCRE_ERROR_NOMATCH) return rc;
2755 break;
2756 }
2757
2758 /* Matched: break the loop if zero characters matched. */
2759
2760 charcount = local_offsets[1] - local_offsets[0];
2761 if (charcount == 0) break;
2762 local_ptr += charcount; /* Advance temporary position ptr */
2763 }
2764
2765 /* At this point we have matched the subpattern matched_count
2766 times, and local_ptr is pointing to the character after the end of the
2767 last match. */
2768
2769 if (matched_count > 0 || allow_zero)
2770 {
2771 const pcre_uchar *end_subpattern = code;
2772 int next_state_offset;
2773
2774 do { end_subpattern += GET(end_subpattern, 1); }
2775 while (*end_subpattern == OP_ALT);
2776 next_state_offset =
2777 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2778
2779 /* Optimization: if there are no more active states, and there
2780 are no new states yet set up, then skip over the subject string
2781 right here, to save looping. Otherwise, set up the new state to swing
2782 into action when the end of the matched substring is reached. */
2783
2784 if (i + 1 >= active_count && new_count == 0)
2785 {
2786 ptr = local_ptr;
2787 clen = 0;
2788 ADD_NEW(next_state_offset, 0);
2789 }
2790 else
2791 {
2792 const pcre_uchar *p = ptr;
2793 const pcre_uchar *pp = local_ptr;
2794 charcount = (int)(pp - p);
2795 #ifdef SUPPORT_UTF
2796 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2797 #endif
2798 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2799 }
2800 }
2801 }
2802 break;
2803
2804 /*-----------------------------------------------------------------*/
2805 case OP_ONCE:
2806 case OP_ONCE_NC:
2807 {
2808 int local_offsets[2];
2809 int local_workspace[1000];
2810
2811 int rc = internal_dfa_exec(
2812 md, /* fixed match data */
2813 code, /* this subexpression's code */
2814 ptr, /* where we currently are */
2815 (int)(ptr - start_subject), /* start offset */
2816 local_offsets, /* offset vector */
2817 sizeof(local_offsets)/sizeof(int), /* size of same */
2818 local_workspace, /* workspace vector */
2819 sizeof(local_workspace)/sizeof(int), /* size of same */
2820 rlevel); /* function recursion level */
2821
2822 if (rc >= 0)
2823 {
2824 const pcre_uchar *end_subpattern = code;
2825 int charcount = local_offsets[1] - local_offsets[0];
2826 int next_state_offset, repeat_state_offset;
2827
2828 do { end_subpattern += GET(end_subpattern, 1); }
2829 while (*end_subpattern == OP_ALT);
2830 next_state_offset =
2831 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2832
2833 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2834 arrange for the repeat state also to be added to the relevant list.
2835 Calculate the offset, or set -1 for no repeat. */
2836
2837 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2838 *end_subpattern == OP_KETRMIN)?
2839 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2840
2841 /* If we have matched an empty string, add the next state at the
2842 current character pointer. This is important so that the duplicate
2843 checking kicks in, which is what breaks infinite loops that match an
2844 empty string. */
2845
2846 if (charcount == 0)
2847 {
2848 ADD_ACTIVE(next_state_offset, 0);
2849 }
2850
2851 /* Optimization: if there are no more active states, and there
2852 are no new states yet set up, then skip over the subject string
2853 right here, to save looping. Otherwise, set up the new state to swing
2854 into action when the end of the matched substring is reached. */
2855
2856 else if (i + 1 >= active_count && new_count == 0)
2857 {
2858 ptr += charcount;
2859 clen = 0;
2860 ADD_NEW(next_state_offset, 0);
2861
2862 /* If we are adding a repeat state at the new character position,
2863 we must fudge things so that it is the only current state.
2864 Otherwise, it might be a duplicate of one we processed before, and
2865 that would cause it to be skipped. */
2866
2867 if (repeat_state_offset >= 0)
2868 {
2869 next_active_state = active_states;
2870 active_count = 0;
2871 i = -1;
2872 ADD_ACTIVE(repeat_state_offset, 0);
2873 }
2874 }
2875 else
2876 {
2877 #ifdef SUPPORT_UTF
2878 const pcre_uchar *p = start_subject + local_offsets[0];
2879 const pcre_uchar *pp = start_subject + local_offsets[1];
2880 while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
2881 #endif
2882 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2883 if (repeat_state_offset >= 0)
2884 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2885 }
2886 }
2887 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2888 }
2889 break;
2890
2891
2892 /* ========================================================================== */
2893 /* Handle callouts */
2894
2895 case OP_CALLOUT:
2896 rrc = 0;
2897 if (PUBL(callout) != NULL)
2898 {
2899 PUBL(callout_block) cb;
2900 cb.version = 1; /* Version 1 of the callout block */
2901 cb.callout_number = code[1];
2902 cb.offset_vector = offsets;
2903 #ifdef COMPILE_PCRE8
2904 cb.subject = (PCRE_SPTR)start_subject;
2905 #else
2906 cb.subject = (PCRE_SPTR16)start_subject;
2907 #endif
2908 cb.subject_length = (int)(end_subject - start_subject);
2909 cb.start_match = (int)(current_subject - start_subject);
2910 cb.current_position = (int)(ptr - start_subject);
2911 cb.pattern_position = GET(code, 2);
2912 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2913 cb.capture_top = 1;
2914 cb.capture_last = -1;
2915 cb.callout_data = md->callout_data;
2916 cb.mark = NULL; /* No (*MARK) support */
2917 if ((rrc = (*PUBL(callout))(&cb)) < 0) return rrc; /* Abandon */
2918 }
2919 if (rrc == 0)
2920 { ADD_ACTIVE(state_offset + PRIV(OP_lengths)[OP_CALLOUT], 0); }
2921 break;
2922
2923
2924 /* ========================================================================== */
2925 default: /* Unsupported opcode */
2926 return PCRE_ERROR_DFA_UITEM;
2927 }
2928
2929 NEXT_ACTIVE_STATE: continue;
2930
2931 } /* End of loop scanning active states */
2932
2933 /* We have finished the processing at the current subject character. If no
2934 new states have been set for the next character, we have found all the
2935 matches that we are going to find. If we are at the top level and partial
2936 matching has been requested, check for appropriate conditions.
2937
2938 The "forced_ fail" variable counts the number of (*F) encountered for the
2939 character. If it is equal to the original active_count (saved in
2940 workspace[1]) it means that (*F) was found on every active state. In this
2941 case we don't want to give a partial match.
2942
2943 The "could_continue" variable is true if a state could have continued but
2944 for the fact that the end of the subject was reached. */
2945
2946 if (new_count <= 0)
2947 {
2948 if (rlevel == 1 && /* Top level, and */
2949 could_continue && /* Some could go on */
2950 forced_fail != workspace[1] && /* Not all forced fail & */
2951 ( /* either... */
2952 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
2953 || /* or... */
2954 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
2955 match_count < 0) /* no matches */
2956 ) && /* And... */
2957 ptr >= end_subject && /* Reached end of subject */
2958 ptr > md->start_used_ptr) /* Inspected non-empty string */
2959 {
2960 if (offsetcount >= 2)
2961 {
2962 offsets[0] = (int)(md->start_used_ptr - start_subject);
2963 offsets[1] = (int)(end_subject - start_subject);
2964 }
2965 match_count = PCRE_ERROR_PARTIAL;
2966 }
2967
2968 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
2969 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
2970 rlevel*2-2, SP));
2971 break; /* In effect, "return", but see the comment below */
2972 }
2973
2974 /* One or more states are active for the next character. */
2975
2976 ptr += clen; /* Advance to next subject character */
2977 } /* Loop to move along the subject string */
2978
2979 /* Control gets here from "break" a few lines above. We do it this way because
2980 if we use "return" above, we have compiler trouble. Some compilers warn if
2981 there's nothing here because they think the function doesn't return a value. On
2982 the other hand, if we put a dummy statement here, some more clever compilers
2983 complain that it can't be reached. Sigh. */
2984
2985 return match_count;
2986 }
2987
2988
2989
2990
2991 /*************************************************
2992 * Execute a Regular Expression - DFA engine *
2993 *************************************************/
2994
2995 /* This external function applies a compiled re to a subject string using a DFA
2996 engine. This function calls the internal function multiple times if the pattern
2997 is not anchored.
2998
2999 Arguments:
3000 argument_re points to the compiled expression
3001 extra_data points to extra data or is NULL
3002 subject points to the subject string
3003 length length of subject string (may contain binary zeros)
3004 start_offset where to start in the subject string
3005 options option bits
3006 offsets vector of match offsets
3007 offsetcount size of same
3008 workspace workspace vector
3009 wscount size of same
3010
3011 Returns: > 0 => number of match offset pairs placed in offsets
3012 = 0 => offsets overflowed; longest matches are present
3013 -1 => failed to match
3014 < -1 => some kind of unexpected problem
3015 */
3016
3017 #ifdef COMPILE_PCRE8
3018 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3019 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
3020 const char *subject, int length, int start_offset, int options, int *offsets,
3021 int offsetcount, int *workspace, int wscount)
3022 #else
3023 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
3024 pcre16_dfa_exec(const pcre16 *argument_re, const pcre16_extra *extra_data,
3025 PCRE_SPTR16 subject, int length, int start_offset, int options, int *offsets,
3026 int offsetcount, int *workspace, int wscount)
3027 #endif
3028 {
3029 REAL_PCRE *re = (REAL_PCRE *)argument_re;
3030 dfa_match_data match_block;
3031 dfa_match_data *md = &match_block;
3032 BOOL utf, anchored, startline, firstline;
3033 const pcre_uchar *current_subject, *end_subject;
3034 const pcre_study_data *study = NULL;
3035
3036 const pcre_uchar *req_char_ptr;
3037 const pcre_uint8 *start_bits = NULL;
3038 BOOL has_first_char = FALSE;
3039 BOOL has_req_char = FALSE;
3040 pcre_uchar first_char = 0;
3041 pcre_uchar first_char2 = 0;
3042 pcre_uchar req_char = 0;
3043 pcre_uchar req_char2 = 0;
3044 int newline;
3045
3046 /* Plausibility checks */
3047
3048 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3049 if (re == NULL || subject == NULL || workspace == NULL ||
3050 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3051 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3052 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3053 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3054
3055 /* We need to find the pointer to any study data before we test for byte
3056 flipping, so we scan the extra_data block first. This may set two fields in the
3057 match block, so we must initialize them beforehand. However, the other fields
3058 in the match block must not be set until after the byte flipping. */
3059
3060 md->tables = re->tables;
3061 md->callout_data = NULL;
3062
3063 if (extra_data != NULL)
3064 {
3065 unsigned int flags = extra_data->flags;
3066 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3067 study = (const pcre_study_data *)extra_data->study_data;
3068 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3069 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3070 return PCRE_ERROR_DFA_UMLIMIT;
3071 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3072 md->callout_data = extra_data->callout_data;
3073 if ((flags & PCRE_EXTRA_TABLES) != 0)
3074 md->tables = extra_data->tables;
3075 }
3076
3077 /* Check that the first field in the block is the magic number. If it is not,
3078 return with PCRE_ERROR_BADMAGIC. However, if the magic number is equal to
3079 REVERSED_MAGIC_NUMBER we return with PCRE_ERROR_BADENDIANNESS, which
3080 means that the pattern is likely compiled with different endianness. */
3081
3082 if (re->magic_number != MAGIC_NUMBER)
3083 return re->magic_number == REVERSED_MAGIC_NUMBER?
3084 PCRE_ERROR_BADENDIANNESS:PCRE_ERROR_BADMAGIC;
3085 if ((re->flags & PCRE_MODE) == 0) return PCRE_ERROR_BADMODE;
3086
3087 /* Set some local values */
3088
3089 current_subject = (const pcre_uchar *)subject + start_offset;
3090 end_subject = (const pcre_uchar *)subject + length;
3091 req_char_ptr = current_subject - 1;
3092
3093 #ifdef SUPPORT_UTF
3094 /* PCRE_UTF16 has the same value as PCRE_UTF8. */
3095 utf = (re->options & PCRE_UTF8) != 0;
3096 #else
3097 utf = FALSE;
3098 #endif
3099
3100 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3101 (re->options & PCRE_ANCHORED) != 0;
3102
3103 /* The remaining fixed data for passing around. */
3104
3105 md->start_code = (const pcre_uchar *)argument_re +
3106 re->name_table_offset + re->name_count * re->name_entry_size;
3107 md->start_subject = (const pcre_uchar *)subject;
3108 md->end_subject = end_subject;
3109 md->start_offset = start_offset;
3110 md->moptions = options;
3111 md->poptions = re->options;
3112
3113 /* If the BSR option is not set at match time, copy what was set
3114 at compile time. */
3115
3116 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3117 {
3118 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3119 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3120 #ifdef BSR_ANYCRLF
3121 else md->moptions |= PCRE_BSR_ANYCRLF;
3122 #endif
3123 }
3124
3125 /* Handle different types of newline. The three bits give eight cases. If
3126 nothing is set at run time, whatever was used at compile time applies. */
3127
3128 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3129 PCRE_NEWLINE_BITS)
3130 {
3131 case 0: newline = NEWLINE; break; /* Compile-time default */
3132 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3133 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3134 case PCRE_NEWLINE_CR+
3135 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3136 case PCRE_NEWLINE_ANY: newline = -1; break;
3137 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3138 default: return PCRE_ERROR_BADNEWLINE;
3139 }
3140
3141 if (newline == -2)
3142 {
3143 md->nltype = NLTYPE_ANYCRLF;
3144 }
3145 else if (newline < 0)
3146 {
3147 md->nltype = NLTYPE_ANY;
3148 }
3149 else
3150 {
3151 md->nltype = NLTYPE_FIXED;
3152 if (newline > 255)
3153 {
3154 md->nllen = 2;
3155 md->nl[0] = (newline >> 8) & 255;
3156 md->nl[1] = newline & 255;
3157 }
3158 else
3159 {
3160 md->nllen = 1;
3161 md->nl[0] = newline;
3162 }
3163 }
3164
3165 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3166 back the character offset. */
3167
3168 #ifdef SUPPORT_UTF
3169 if (utf && (options & PCRE_NO_UTF8_CHECK) == 0)
3170 {
3171 int erroroffset;
3172 int errorcode = PRIV(valid_utf)((pcre_uchar *)subject, length, &erroroffset);
3173 if (errorcode != 0)
3174 {
3175 if (offsetcount >= 2)
3176 {
3177 offsets[0] = erroroffset;
3178 offsets[1] = errorcode;
3179 }
3180 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3181 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3182 }
3183 if (start_offset > 0 && start_offset < length &&
3184 NOT_FIRSTCHAR(((PCRE_PUCHAR)subject)[start_offset]))
3185 return PCRE_ERROR_BADUTF8_OFFSET;
3186 }
3187 #endif
3188
3189 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3190 is a feature that makes it possible to save compiled regex and re-use them
3191 in other programs later. */
3192
3193 if (md->tables == NULL) md->tables = PRIV(default_tables);
3194
3195 /* The "must be at the start of a line" flags are used in a loop when finding
3196 where to start. */
3197
3198 startline = (re->flags & PCRE_STARTLINE) != 0;
3199 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3200
3201 /* Set up the first character to match, if available. The first_byte value is
3202 never set for an anchored regular expression, but the anchoring may be forced
3203 at run time, so we have to test for anchoring. The first char may be unset for
3204 an unanchored pattern, of course. If there's no first char and the pattern was
3205 studied, there may be a bitmap of possible first characters. */
3206
3207 if (!anchored)
3208 {
3209 if ((re->flags & PCRE_FIRSTSET) != 0)
3210 {
3211 has_first_char = TRUE;
3212 first_char = first_char2 = re->first_char;
3213 if ((re->flags & PCRE_FCH_CASELESS) != 0)
3214 {
3215 first_char2 = TABLE_GET(first_char, md->tables + fcc_offset, first_char);
3216 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3217 if (utf && first_char > 127)
3218 first_char2 = UCD_OTHERCASE(first_char);
3219 #endif
3220 }
3221 }
3222 else
3223 {
3224 if (!startline && study != NULL &&
3225 (study->flags & PCRE_STUDY_MAPPED) != 0)
3226 start_bits = study->start_bits;
3227 }
3228 }
3229
3230 /* For anchored or unanchored matches, there may be a "last known required
3231 character" set. */
3232
3233 if ((re->flags & PCRE_REQCHSET) != 0)
3234 {
3235 has_req_char = TRUE;
3236 req_char = req_char2 = re->req_char;
3237 if ((re->flags & PCRE_RCH_CASELESS) != 0)
3238 {
3239 req_char2 = TABLE_GET(req_char, md->tables + fcc_offset, req_char);
3240 #if defined SUPPORT_UCP && !(defined COMPILE_PCRE8)
3241 if (utf && req_char > 127)
3242 req_char2 = UCD_OTHERCASE(req_char);
3243 #endif
3244 }
3245 }
3246
3247 /* Call the main matching function, looping for a non-anchored regex after a
3248 failed match. If not restarting, perform certain optimizations at the start of
3249 a match. */
3250
3251 for (;;)
3252 {
3253 int rc;
3254
3255 if ((options & PCRE_DFA_RESTART) == 0)
3256 {
3257 const pcre_uchar *save_end_subject = end_subject;
3258
3259 /* If firstline is TRUE, the start of the match is constrained to the first
3260 line of a multiline string. Implement this by temporarily adjusting
3261 end_subject so that we stop scanning at a newline. If the match fails at
3262 the newline, later code breaks this loop. */
3263
3264 if (firstline)
3265 {
3266 PCRE_PUCHAR t = current_subject;
3267 #ifdef SUPPORT_UTF
3268 if (utf)
3269 {
3270 while (t < md->end_subject && !IS_NEWLINE(t))
3271 {
3272 t++;
3273 ACROSSCHAR(t < end_subject, *t, t++);
3274 }
3275 }
3276 else
3277 #endif
3278 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3279 end_subject = t;
3280 }
3281
3282 /* There are some optimizations that avoid running the match if a known
3283 starting point is not found. However, there is an option that disables
3284 these, for testing and for ensuring that all callouts do actually occur.
3285 The option can be set in the regex by (*NO_START_OPT) or passed in
3286 match-time options. */
3287
3288 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3289 {
3290 /* Advance to a known first char. */
3291
3292 if (has_first_char)
3293 {
3294 if (first_char != first_char2)
3295 while (current_subject < end_subject &&
3296 *current_subject != first_char && *current_subject != first_char2)
3297 current_subject++;
3298 else
3299 while (current_subject < end_subject &&
3300 *current_subject != first_char)
3301 current_subject++;
3302 }
3303
3304 /* Or to just after a linebreak for a multiline match if possible */
3305
3306 else if (startline)
3307 {
3308 if (current_subject > md->start_subject + start_offset)
3309 {
3310 #ifdef SUPPORT_UTF
3311 if (utf)
3312 {
3313 while (current_subject < end_subject &&
3314 !WAS_NEWLINE(current_subject))
3315 {
3316 current_subject++;
3317 ACROSSCHAR(current_subject < end_subject, *current_subject,
3318 current_subject++);
3319 }
3320 }
3321 else
3322 #endif
3323 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3324 current_subject++;
3325
3326 /* If we have just passed a CR and the newline option is ANY or
3327 ANYCRLF, and we are now at a LF, advance the match position by one
3328 more character. */
3329
3330 if (current_subject[-1] == CHAR_CR &&
3331 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3332 current_subject < end_subject &&
3333 *current_subject == CHAR_NL)
3334 current_subject++;
3335 }
3336 }
3337
3338 /* Or to a non-unique first char after study */
3339
3340 else if (start_bits != NULL)
3341 {
3342 while (current_subject < end_subject)
3343 {
3344 register unsigned int c = *current_subject;
3345 #ifndef COMPILE_PCRE8
3346 if (c > 255) c = 255;
3347 #endif
3348 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3349 {
3350 current_subject++;
3351 #if defined SUPPORT_UTF && defined COMPILE_PCRE8
3352 /* In non 8-bit mode, the iteration will stop for
3353 characters > 255 at the beginning or not stop at all. */
3354 if (utf)
3355 ACROSSCHAR(current_subject < end_subject, *current_subject,
3356 current_subject++);
3357 #endif
3358 }
3359 else break;
3360 }
3361 }
3362 }
3363
3364 /* Restore fudged end_subject */
3365
3366 end_subject = save_end_subject;
3367
3368 /* The following two optimizations are disabled for partial matching or if
3369 disabling is explicitly requested (and of course, by the test above, this
3370 code is not obeyed when restarting after a partial match). */
3371
3372 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0 &&
3373 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3374 {
3375 /* If the pattern was studied, a minimum subject length may be set. This
3376 is a lower bound; no actual string of that length may actually match the
3377 pattern. Although the value is, strictly, in characters, we treat it as
3378 bytes to avoid spending too much time in this optimization. */
3379
3380 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3381 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3382 return PCRE_ERROR_NOMATCH;
3383
3384 /* If req_char is set, we know that that character must appear in the
3385 subject for the match to succeed. If the first character is set, req_char
3386 must be later in the subject; otherwise the test starts at the match
3387 point. This optimization can save a huge amount of work in patterns with
3388 nested unlimited repeats that aren't going to match. Writing separate
3389 code for cased/caseless versions makes it go faster, as does using an
3390 autoincrement and backing off on a match.
3391
3392 HOWEVER: when the subject string is very, very long, searching to its end
3393 can take a long time, and give bad performance on quite ordinary
3394 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3395 string... so we don't do this when the string is sufficiently long. */
3396
3397 if (has_req_char && end_subject - current_subject < REQ_BYTE_MAX)
3398 {
3399 register PCRE_PUCHAR p = current_subject + (has_first_char? 1:0);
3400
3401 /* We don't need to repeat the search if we haven't yet reached the
3402 place we found it at last time. */
3403
3404 if (p > req_char_ptr)
3405 {
3406 if (req_char != req_char2)
3407 {
3408 while (p < end_subject)
3409 {
3410 register int pp = *p++;
3411 if (pp == req_char || pp == req_char2) { p--; break; }
3412 }
3413 }
3414 else
3415 {
3416 while (p < end_subject)
3417 {
3418 if (*p++ == req_char) { p--; break; }
3419 }
3420 }
3421
3422 /* If we can't find the required character, break the matching loop,
3423 which will cause a return or PCRE_ERROR_NOMATCH. */
3424
3425 if (p >= end_subject) break;
3426
3427 /* If we have found the required character, save the point where we
3428 found it, so that we don't search again next time round the loop if
3429 the start hasn't passed this character yet. */
3430
3431 req_char_ptr = p;
3432 }
3433 }
3434 }
3435 } /* End of optimizations that are done when not restarting */
3436
3437 /* OK, now we can do the business */
3438
3439 md->start_used_ptr = current_subject;
3440 md->recursive = NULL;
3441
3442 rc = internal_dfa_exec(
3443 md, /* fixed match data */
3444 md->start_code, /* this subexpression's code */
3445 current_subject, /* where we currently are */
3446 start_offset, /* start offset in subject */
3447 offsets, /* offset vector */
3448 offsetcount, /* size of same */
3449 workspace, /* workspace vector */
3450 wscount, /* size of same */
3451 0); /* function recurse level */
3452
3453 /* Anything other than "no match" means we are done, always; otherwise, carry
3454 on only if not anchored. */
3455
3456 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3457
3458 /* Advance to the next subject character unless we are at the end of a line
3459 and firstline is set. */
3460
3461 if (firstline && IS_NEWLINE(current_subject)) break;
3462 current_subject++;
3463 #ifdef SUPPORT_UTF
3464 if (utf)
3465 {
3466 ACROSSCHAR(current_subject < end_subject, *current_subject,
3467 current_subject++);
3468 }
3469 #endif
3470 if (current_subject > end_subject) break;
3471
3472 /* If we have just passed a CR and we are now at a LF, and the pattern does
3473 not contain any explicit matches for \r or \n, and the newline option is CRLF
3474 or ANY or ANYCRLF, advance the match position by one more character. */
3475
3476 if (current_subject[-1] == CHAR_CR &&
3477 current_subject < end_subject &&
3478 *current_subject == CHAR_NL &&
3479 (re->flags & PCRE_HASCRORLF) == 0 &&
3480 (md->nltype == NLTYPE_ANY ||
3481 md->nltype == NLTYPE_ANYCRLF ||
3482 md->nllen == 2))
3483 current_subject++;
3484
3485 } /* "Bumpalong" loop */
3486
3487 return PCRE_ERROR_NOMATCH;
3488 }
3489
3490 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5