/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 602 - (show annotations)
Wed May 25 08:29:03 2011 UTC (8 years, 6 months ago) by ph10
File MIME type: text/plain
File size: 114806 byte(s)
Error occurred while calculating annotation data.
Remove OP_OPT by handling /i and /m entirely at compile time. Fixes bug with 
patterns like /(?i:([^b]))(?1)/, where the /i option was mishandled.
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2011 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41
42 /* This module contains the external function pcre_dfa_exec(), which is an
43 alternative matching function that uses a sort of DFA algorithm (not a true
44 FSM). This is NOT Perl- compatible, but it has advantages in certain
45 applications. */
46
47
48 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
49 the performance of his patterns greatly. I could not use it as it stood, as it
50 was not thread safe, and made assumptions about pattern sizes. Also, it caused
51 test 7 to loop, and test 9 to crash with a segfault.
52
53 The issue is the check for duplicate states, which is done by a simple linear
54 search up the state list. (Grep for "duplicate" below to find the code.) For
55 many patterns, there will never be many states active at one time, so a simple
56 linear search is fine. In patterns that have many active states, it might be a
57 bottleneck. The suggested code used an indexing scheme to remember which states
58 had previously been used for each character, and avoided the linear search when
59 it knew there was no chance of a duplicate. This was implemented when adding
60 states to the state lists.
61
62 I wrote some thread-safe, not-limited code to try something similar at the time
63 of checking for duplicates (instead of when adding states), using index vectors
64 on the stack. It did give a 13% improvement with one specially constructed
65 pattern for certain subject strings, but on other strings and on many of the
66 simpler patterns in the test suite it did worse. The major problem, I think,
67 was the extra time to initialize the index. This had to be done for each call
68 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
69 only once - I suspect this was the cause of the problems with the tests.)
70
71 Overall, I concluded that the gains in some cases did not outweigh the losses
72 in others, so I abandoned this code. */
73
74
75
76 #ifdef HAVE_CONFIG_H
77 #include "config.h"
78 #endif
79
80 #define NLBLOCK md /* Block containing newline information */
81 #define PSSTART start_subject /* Field containing processed string start */
82 #define PSEND end_subject /* Field containing processed string end */
83
84 #include "pcre_internal.h"
85
86
87 /* For use to indent debugging output */
88
89 #define SP " "
90
91
92 /*************************************************
93 * Code parameters and static tables *
94 *************************************************/
95
96 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
97 into others, under special conditions. A gap of 20 between the blocks should be
98 enough. The resulting opcodes don't have to be less than 256 because they are
99 never stored, so we push them well clear of the normal opcodes. */
100
101 #define OP_PROP_EXTRA 300
102 #define OP_EXTUNI_EXTRA 320
103 #define OP_ANYNL_EXTRA 340
104 #define OP_HSPACE_EXTRA 360
105 #define OP_VSPACE_EXTRA 380
106
107
108 /* This table identifies those opcodes that are followed immediately by a
109 character that is to be tested in some way. This makes it possible to
110 centralize the loading of these characters. In the case of Type * etc, the
111 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
112 small value. Non-zero values in the table are the offsets from the opcode where
113 the character is to be found. ***NOTE*** If the start of this table is
114 modified, the three tables that follow must also be modified. */
115
116 static const uschar coptable[] = {
117 0, /* End */
118 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
119 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
120 0, 0, 0, /* Any, AllAny, Anybyte */
121 0, 0, /* \P, \p */
122 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
123 0, /* \X */
124 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
125 1, /* Char */
126 1, /* Chari */
127 1, /* not */
128 1, /* noti */
129 /* Positive single-char repeats */
130 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
131 3, 3, 3, /* upto, minupto, exact */
132 1, 1, 1, 3, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 3, 3, 3, /* upto I, minupto I, exact I */
135 1, 1, 1, 3, /* *+I, ++I, ?+I, upto+I */
136 /* Negative single-char repeats - only for chars < 256 */
137 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
138 3, 3, 3, /* NOT upto, minupto, exact */
139 1, 1, 1, 3, /* NOT *+, ++, ?+, upto+ */
140 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
141 3, 3, 3, /* NOT upto I, minupto I, exact I */
142 1, 1, 1, 3, /* NOT *+I, ++I, ?+I, upto+I */
143 /* Positive type repeats */
144 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
145 3, 3, 3, /* Type upto, minupto, exact */
146 1, 1, 1, 3, /* Type *+, ++, ?+, upto+ */
147 /* Character class & ref repeats */
148 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
149 0, 0, /* CRRANGE, CRMINRANGE */
150 0, /* CLASS */
151 0, /* NCLASS */
152 0, /* XCLASS - variable length */
153 0, /* REF */
154 0, /* REFI */
155 0, /* RECURSE */
156 0, /* CALLOUT */
157 0, /* Alt */
158 0, /* Ket */
159 0, /* KetRmax */
160 0, /* KetRmin */
161 0, /* Assert */
162 0, /* Assert not */
163 0, /* Assert behind */
164 0, /* Assert behind not */
165 0, /* Reverse */
166 0, 0, 0, 0, /* ONCE, BRA, CBRA, COND */
167 0, 0, 0, /* SBRA, SCBRA, SCOND */
168 0, 0, /* CREF, NCREF */
169 0, 0, /* RREF, NRREF */
170 0, /* DEF */
171 0, 0, /* BRAZERO, BRAMINZERO */
172 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG, */
173 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */
174 0, 0, 0, 0, 0 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
175 };
176
177 /* This table identifies those opcodes that inspect a character. It is used to
178 remember the fact that a character could have been inspected when the end of
179 the subject is reached. ***NOTE*** If the start of this table is modified, the
180 two tables that follow must also be modified. */
181
182 static const uschar poptable[] = {
183 0, /* End */
184 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
185 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
186 1, 1, 1, /* Any, AllAny, Anybyte */
187 1, 1, /* \P, \p */
188 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
189 1, /* \X */
190 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
191 1, /* Char */
192 1, /* Chari */
193 1, /* not */
194 1, /* noti */
195 /* Positive single-char repeats */
196 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
197 1, 1, 1, /* upto, minupto, exact */
198 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
199 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
200 1, 1, 1, /* upto I, minupto I, exact I */
201 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
202 /* Negative single-char repeats - only for chars < 256 */
203 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
204 1, 1, 1, /* NOT upto, minupto, exact */
205 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
206 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
207 1, 1, 1, /* NOT upto I, minupto I, exact I */
208 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
209 /* Positive type repeats */
210 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
211 1, 1, 1, /* Type upto, minupto, exact */
212 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
213 /* Character class & ref repeats */
214 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
215 1, 1, /* CRRANGE, CRMINRANGE */
216 1, /* CLASS */
217 1, /* NCLASS */
218 1, /* XCLASS - variable length */
219 0, /* REF */
220 0, /* REFI */
221 0, /* RECURSE */
222 0, /* CALLOUT */
223 0, /* Alt */
224 0, /* Ket */
225 0, /* KetRmax */
226 0, /* KetRmin */
227 0, /* Assert */
228 0, /* Assert not */
229 0, /* Assert behind */
230 0, /* Assert behind not */
231 0, /* Reverse */
232 0, 0, 0, 0, /* ONCE, BRA, CBRA, COND */
233 0, 0, 0, /* SBRA, SCBRA, SCOND */
234 0, 0, /* CREF, NCREF */
235 0, 0, /* RREF, NRREF */
236 0, /* DEF */
237 0, 0, /* BRAZERO, BRAMINZERO */
238 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG, */
239 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */
240 0, 0, 0, 0, 0 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
241 };
242
243 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
244 and \w */
245
246 static const uschar toptable1[] = {
247 0, 0, 0, 0, 0, 0,
248 ctype_digit, ctype_digit,
249 ctype_space, ctype_space,
250 ctype_word, ctype_word,
251 0, 0 /* OP_ANY, OP_ALLANY */
252 };
253
254 static const uschar toptable2[] = {
255 0, 0, 0, 0, 0, 0,
256 ctype_digit, 0,
257 ctype_space, 0,
258 ctype_word, 0,
259 1, 1 /* OP_ANY, OP_ALLANY */
260 };
261
262
263 /* Structure for holding data about a particular state, which is in effect the
264 current data for an active path through the match tree. It must consist
265 entirely of ints because the working vector we are passed, and which we put
266 these structures in, is a vector of ints. */
267
268 typedef struct stateblock {
269 int offset; /* Offset to opcode */
270 int count; /* Count for repeats */
271 int data; /* Some use extra data */
272 } stateblock;
273
274 #define INTS_PER_STATEBLOCK (sizeof(stateblock)/sizeof(int))
275
276
277 #ifdef PCRE_DEBUG
278 /*************************************************
279 * Print character string *
280 *************************************************/
281
282 /* Character string printing function for debugging.
283
284 Arguments:
285 p points to string
286 length number of bytes
287 f where to print
288
289 Returns: nothing
290 */
291
292 static void
293 pchars(unsigned char *p, int length, FILE *f)
294 {
295 int c;
296 while (length-- > 0)
297 {
298 if (isprint(c = *(p++)))
299 fprintf(f, "%c", c);
300 else
301 fprintf(f, "\\x%02x", c);
302 }
303 }
304 #endif
305
306
307
308 /*************************************************
309 * Execute a Regular Expression - DFA engine *
310 *************************************************/
311
312 /* This internal function applies a compiled pattern to a subject string,
313 starting at a given point, using a DFA engine. This function is called from the
314 external one, possibly multiple times if the pattern is not anchored. The
315 function calls itself recursively for some kinds of subpattern.
316
317 Arguments:
318 md the match_data block with fixed information
319 this_start_code the opening bracket of this subexpression's code
320 current_subject where we currently are in the subject string
321 start_offset start offset in the subject string
322 offsets vector to contain the matching string offsets
323 offsetcount size of same
324 workspace vector of workspace
325 wscount size of same
326 rlevel function call recursion level
327 recursing regex recursive call level
328
329 Returns: > 0 => number of match offset pairs placed in offsets
330 = 0 => offsets overflowed; longest matches are present
331 -1 => failed to match
332 < -1 => some kind of unexpected problem
333
334 The following macros are used for adding states to the two state vectors (one
335 for the current character, one for the following character). */
336
337 #define ADD_ACTIVE(x,y) \
338 if (active_count++ < wscount) \
339 { \
340 next_active_state->offset = (x); \
341 next_active_state->count = (y); \
342 next_active_state++; \
343 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
344 } \
345 else return PCRE_ERROR_DFA_WSSIZE
346
347 #define ADD_ACTIVE_DATA(x,y,z) \
348 if (active_count++ < wscount) \
349 { \
350 next_active_state->offset = (x); \
351 next_active_state->count = (y); \
352 next_active_state->data = (z); \
353 next_active_state++; \
354 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
355 } \
356 else return PCRE_ERROR_DFA_WSSIZE
357
358 #define ADD_NEW(x,y) \
359 if (new_count++ < wscount) \
360 { \
361 next_new_state->offset = (x); \
362 next_new_state->count = (y); \
363 next_new_state++; \
364 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
365 } \
366 else return PCRE_ERROR_DFA_WSSIZE
367
368 #define ADD_NEW_DATA(x,y,z) \
369 if (new_count++ < wscount) \
370 { \
371 next_new_state->offset = (x); \
372 next_new_state->count = (y); \
373 next_new_state->data = (z); \
374 next_new_state++; \
375 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
376 } \
377 else return PCRE_ERROR_DFA_WSSIZE
378
379 /* And now, here is the code */
380
381 static int
382 internal_dfa_exec(
383 dfa_match_data *md,
384 const uschar *this_start_code,
385 const uschar *current_subject,
386 int start_offset,
387 int *offsets,
388 int offsetcount,
389 int *workspace,
390 int wscount,
391 int rlevel,
392 int recursing)
393 {
394 stateblock *active_states, *new_states, *temp_states;
395 stateblock *next_active_state, *next_new_state;
396
397 const uschar *ctypes, *lcc, *fcc;
398 const uschar *ptr;
399 const uschar *end_code, *first_op;
400
401 int active_count, new_count, match_count;
402
403 /* Some fields in the md block are frequently referenced, so we load them into
404 independent variables in the hope that this will perform better. */
405
406 const uschar *start_subject = md->start_subject;
407 const uschar *end_subject = md->end_subject;
408 const uschar *start_code = md->start_code;
409
410 #ifdef SUPPORT_UTF8
411 BOOL utf8 = (md->poptions & PCRE_UTF8) != 0;
412 #else
413 BOOL utf8 = FALSE;
414 #endif
415
416 rlevel++;
417 offsetcount &= (-2);
418
419 wscount -= 2;
420 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
421 (2 * INTS_PER_STATEBLOCK);
422
423 DPRINTF(("\n%.*s---------------------\n"
424 "%.*sCall to internal_dfa_exec f=%d r=%d\n",
425 rlevel*2-2, SP, rlevel*2-2, SP, rlevel, recursing));
426
427 ctypes = md->tables + ctypes_offset;
428 lcc = md->tables + lcc_offset;
429 fcc = md->tables + fcc_offset;
430
431 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
432
433 active_states = (stateblock *)(workspace + 2);
434 next_new_state = new_states = active_states + wscount;
435 new_count = 0;
436
437 first_op = this_start_code + 1 + LINK_SIZE +
438 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA)? 2:0);
439
440 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
441 the alternative states onto the list, and find out where the end is. This
442 makes is possible to use this function recursively, when we want to stop at a
443 matching internal ket rather than at the end.
444
445 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
446 a backward assertion. In that case, we have to find out the maximum amount to
447 move back, and set up each alternative appropriately. */
448
449 if (*first_op == OP_REVERSE)
450 {
451 int max_back = 0;
452 int gone_back;
453
454 end_code = this_start_code;
455 do
456 {
457 int back = GET(end_code, 2+LINK_SIZE);
458 if (back > max_back) max_back = back;
459 end_code += GET(end_code, 1);
460 }
461 while (*end_code == OP_ALT);
462
463 /* If we can't go back the amount required for the longest lookbehind
464 pattern, go back as far as we can; some alternatives may still be viable. */
465
466 #ifdef SUPPORT_UTF8
467 /* In character mode we have to step back character by character */
468
469 if (utf8)
470 {
471 for (gone_back = 0; gone_back < max_back; gone_back++)
472 {
473 if (current_subject <= start_subject) break;
474 current_subject--;
475 while (current_subject > start_subject &&
476 (*current_subject & 0xc0) == 0x80)
477 current_subject--;
478 }
479 }
480 else
481 #endif
482
483 /* In byte-mode we can do this quickly. */
484
485 {
486 gone_back = (current_subject - max_back < start_subject)?
487 (int)(current_subject - start_subject) : max_back;
488 current_subject -= gone_back;
489 }
490
491 /* Save the earliest consulted character */
492
493 if (current_subject < md->start_used_ptr)
494 md->start_used_ptr = current_subject;
495
496 /* Now we can process the individual branches. */
497
498 end_code = this_start_code;
499 do
500 {
501 int back = GET(end_code, 2+LINK_SIZE);
502 if (back <= gone_back)
503 {
504 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
505 ADD_NEW_DATA(-bstate, 0, gone_back - back);
506 }
507 end_code += GET(end_code, 1);
508 }
509 while (*end_code == OP_ALT);
510 }
511
512 /* This is the code for a "normal" subpattern (not a backward assertion). The
513 start of a whole pattern is always one of these. If we are at the top level,
514 we may be asked to restart matching from the same point that we reached for a
515 previous partial match. We still have to scan through the top-level branches to
516 find the end state. */
517
518 else
519 {
520 end_code = this_start_code;
521
522 /* Restarting */
523
524 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
525 {
526 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
527 new_count = workspace[1];
528 if (!workspace[0])
529 memcpy(new_states, active_states, new_count * sizeof(stateblock));
530 }
531
532 /* Not restarting */
533
534 else
535 {
536 int length = 1 + LINK_SIZE +
537 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA)? 2:0);
538 do
539 {
540 ADD_NEW((int)(end_code - start_code + length), 0);
541 end_code += GET(end_code, 1);
542 length = 1 + LINK_SIZE;
543 }
544 while (*end_code == OP_ALT);
545 }
546 }
547
548 workspace[0] = 0; /* Bit indicating which vector is current */
549
550 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, end_code - start_code));
551
552 /* Loop for scanning the subject */
553
554 ptr = current_subject;
555 for (;;)
556 {
557 int i, j;
558 int clen, dlen;
559 unsigned int c, d;
560 int forced_fail = 0;
561 BOOL could_continue = FALSE;
562
563 /* Make the new state list into the active state list and empty the
564 new state list. */
565
566 temp_states = active_states;
567 active_states = new_states;
568 new_states = temp_states;
569 active_count = new_count;
570 new_count = 0;
571
572 workspace[0] ^= 1; /* Remember for the restarting feature */
573 workspace[1] = active_count;
574
575 #ifdef PCRE_DEBUG
576 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
577 pchars((uschar *)ptr, strlen((char *)ptr), stdout);
578 printf("\"\n");
579
580 printf("%.*sActive states: ", rlevel*2-2, SP);
581 for (i = 0; i < active_count; i++)
582 printf("%d/%d ", active_states[i].offset, active_states[i].count);
583 printf("\n");
584 #endif
585
586 /* Set the pointers for adding new states */
587
588 next_active_state = active_states + active_count;
589 next_new_state = new_states;
590
591 /* Load the current character from the subject outside the loop, as many
592 different states may want to look at it, and we assume that at least one
593 will. */
594
595 if (ptr < end_subject)
596 {
597 clen = 1; /* Number of bytes in the character */
598 #ifdef SUPPORT_UTF8
599 if (utf8) { GETCHARLEN(c, ptr, clen); } else
600 #endif /* SUPPORT_UTF8 */
601 c = *ptr;
602 }
603 else
604 {
605 clen = 0; /* This indicates the end of the subject */
606 c = NOTACHAR; /* This value should never actually be used */
607 }
608
609 /* Scan up the active states and act on each one. The result of an action
610 may be to add more states to the currently active list (e.g. on hitting a
611 parenthesis) or it may be to put states on the new list, for considering
612 when we move the character pointer on. */
613
614 for (i = 0; i < active_count; i++)
615 {
616 stateblock *current_state = active_states + i;
617 BOOL caseless = FALSE;
618 const uschar *code;
619 int state_offset = current_state->offset;
620 int count, codevalue, rrc;
621
622 #ifdef PCRE_DEBUG
623 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
624 if (clen == 0) printf("EOL\n");
625 else if (c > 32 && c < 127) printf("'%c'\n", c);
626 else printf("0x%02x\n", c);
627 #endif
628
629 /* A negative offset is a special case meaning "hold off going to this
630 (negated) state until the number of characters in the data field have
631 been skipped". */
632
633 if (state_offset < 0)
634 {
635 if (current_state->data > 0)
636 {
637 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
638 ADD_NEW_DATA(state_offset, current_state->count,
639 current_state->data - 1);
640 continue;
641 }
642 else
643 {
644 current_state->offset = state_offset = -state_offset;
645 }
646 }
647
648 /* Check for a duplicate state with the same count, and skip if found.
649 See the note at the head of this module about the possibility of improving
650 performance here. */
651
652 for (j = 0; j < i; j++)
653 {
654 if (active_states[j].offset == state_offset &&
655 active_states[j].count == current_state->count)
656 {
657 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
658 goto NEXT_ACTIVE_STATE;
659 }
660 }
661
662 /* The state offset is the offset to the opcode */
663
664 code = start_code + state_offset;
665 codevalue = *code;
666
667 /* If this opcode inspects a character, but we are at the end of the
668 subject, remember the fact for use when testing for a partial match. */
669
670 if (clen == 0 && poptable[codevalue] != 0)
671 could_continue = TRUE;
672
673 /* If this opcode is followed by an inline character, load it. It is
674 tempting to test for the presence of a subject character here, but that
675 is wrong, because sometimes zero repetitions of the subject are
676 permitted.
677
678 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
679 argument that is not a data character - but is always one byte long. We
680 have to take special action to deal with \P, \p, \H, \h, \V, \v and \X in
681 this case. To keep the other cases fast, convert these ones to new opcodes.
682 */
683
684 if (coptable[codevalue] > 0)
685 {
686 dlen = 1;
687 #ifdef SUPPORT_UTF8
688 if (utf8) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
689 #endif /* SUPPORT_UTF8 */
690 d = code[coptable[codevalue]];
691 if (codevalue >= OP_TYPESTAR)
692 {
693 switch(d)
694 {
695 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
696 case OP_NOTPROP:
697 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
698 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
699 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
700 case OP_NOT_HSPACE:
701 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
702 case OP_NOT_VSPACE:
703 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
704 default: break;
705 }
706 }
707 }
708 else
709 {
710 dlen = 0; /* Not strictly necessary, but compilers moan */
711 d = NOTACHAR; /* if these variables are not set. */
712 }
713
714
715 /* Now process the individual opcodes */
716
717 switch (codevalue)
718 {
719 /* ========================================================================== */
720 /* These cases are never obeyed. This is a fudge that causes a compile-
721 time error if the vectors coptable or poptable, which are indexed by
722 opcode, are not the correct length. It seems to be the only way to do
723 such a check at compile time, as the sizeof() operator does not work
724 in the C preprocessor. */
725
726 case OP_TABLE_LENGTH:
727 case OP_TABLE_LENGTH +
728 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
729 (sizeof(poptable) == OP_TABLE_LENGTH)):
730 break;
731
732 /* ========================================================================== */
733 /* Reached a closing bracket. If not at the end of the pattern, carry
734 on with the next opcode. Otherwise, unless we have an empty string and
735 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
736 start of the subject, save the match data, shifting up all previous
737 matches so we always have the longest first. */
738
739 case OP_KET:
740 case OP_KETRMIN:
741 case OP_KETRMAX:
742 if (code != end_code)
743 {
744 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
745 if (codevalue != OP_KET)
746 {
747 ADD_ACTIVE(state_offset - GET(code, 1), 0);
748 }
749 }
750 else
751 {
752 if (ptr > current_subject ||
753 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
754 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
755 current_subject > start_subject + md->start_offset)))
756 {
757 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
758 else if (match_count > 0 && ++match_count * 2 >= offsetcount)
759 match_count = 0;
760 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
761 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
762 if (offsetcount >= 2)
763 {
764 offsets[0] = (int)(current_subject - start_subject);
765 offsets[1] = (int)(ptr - start_subject);
766 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
767 offsets[1] - offsets[0], current_subject));
768 }
769 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
770 {
771 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
772 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
773 match_count, rlevel*2-2, SP));
774 return match_count;
775 }
776 }
777 }
778 break;
779
780 /* ========================================================================== */
781 /* These opcodes add to the current list of states without looking
782 at the current character. */
783
784 /*-----------------------------------------------------------------*/
785 case OP_ALT:
786 do { code += GET(code, 1); } while (*code == OP_ALT);
787 ADD_ACTIVE((int)(code - start_code), 0);
788 break;
789
790 /*-----------------------------------------------------------------*/
791 case OP_BRA:
792 case OP_SBRA:
793 do
794 {
795 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
796 code += GET(code, 1);
797 }
798 while (*code == OP_ALT);
799 break;
800
801 /*-----------------------------------------------------------------*/
802 case OP_CBRA:
803 case OP_SCBRA:
804 ADD_ACTIVE((int)(code - start_code + 3 + LINK_SIZE), 0);
805 code += GET(code, 1);
806 while (*code == OP_ALT)
807 {
808 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
809 code += GET(code, 1);
810 }
811 break;
812
813 /*-----------------------------------------------------------------*/
814 case OP_BRAZERO:
815 case OP_BRAMINZERO:
816 ADD_ACTIVE(state_offset + 1, 0);
817 code += 1 + GET(code, 2);
818 while (*code == OP_ALT) code += GET(code, 1);
819 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
820 break;
821
822 /*-----------------------------------------------------------------*/
823 case OP_SKIPZERO:
824 code += 1 + GET(code, 2);
825 while (*code == OP_ALT) code += GET(code, 1);
826 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
827 break;
828
829 /*-----------------------------------------------------------------*/
830 case OP_CIRC:
831 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
832 { ADD_ACTIVE(state_offset + 1, 0); }
833 break;
834
835 /*-----------------------------------------------------------------*/
836 case OP_CIRCM:
837 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
838 (ptr != end_subject && WAS_NEWLINE(ptr)))
839 { ADD_ACTIVE(state_offset + 1, 0); }
840 break;
841
842 /*-----------------------------------------------------------------*/
843 case OP_EOD:
844 if (ptr >= end_subject)
845 {
846 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
847 could_continue = TRUE;
848 else { ADD_ACTIVE(state_offset + 1, 0); }
849 }
850 break;
851
852 /*-----------------------------------------------------------------*/
853 case OP_SOD:
854 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
855 break;
856
857 /*-----------------------------------------------------------------*/
858 case OP_SOM:
859 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
860 break;
861
862
863 /* ========================================================================== */
864 /* These opcodes inspect the next subject character, and sometimes
865 the previous one as well, but do not have an argument. The variable
866 clen contains the length of the current character and is zero if we are
867 at the end of the subject. */
868
869 /*-----------------------------------------------------------------*/
870 case OP_ANY:
871 if (clen > 0 && !IS_NEWLINE(ptr))
872 { ADD_NEW(state_offset + 1, 0); }
873 break;
874
875 /*-----------------------------------------------------------------*/
876 case OP_ALLANY:
877 if (clen > 0)
878 { ADD_NEW(state_offset + 1, 0); }
879 break;
880
881 /*-----------------------------------------------------------------*/
882 case OP_EODN:
883 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
884 could_continue = TRUE;
885 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
886 { ADD_ACTIVE(state_offset + 1, 0); }
887 break;
888
889 /*-----------------------------------------------------------------*/
890 case OP_DOLL:
891 if ((md->moptions & PCRE_NOTEOL) == 0)
892 {
893 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
894 could_continue = TRUE;
895 else if (clen == 0 ||
896 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
897 (ptr == end_subject - md->nllen)
898 ))
899 { ADD_ACTIVE(state_offset + 1, 0); }
900 }
901 break;
902
903 /*-----------------------------------------------------------------*/
904 case OP_DOLLM:
905 if ((md->moptions & PCRE_NOTEOL) == 0)
906 {
907 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
908 could_continue = TRUE;
909 else if (clen == 0 ||
910 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
911 { ADD_ACTIVE(state_offset + 1, 0); }
912 }
913 else if (IS_NEWLINE(ptr))
914 { ADD_ACTIVE(state_offset + 1, 0); }
915 break;
916
917 /*-----------------------------------------------------------------*/
918
919 case OP_DIGIT:
920 case OP_WHITESPACE:
921 case OP_WORDCHAR:
922 if (clen > 0 && c < 256 &&
923 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
924 { ADD_NEW(state_offset + 1, 0); }
925 break;
926
927 /*-----------------------------------------------------------------*/
928 case OP_NOT_DIGIT:
929 case OP_NOT_WHITESPACE:
930 case OP_NOT_WORDCHAR:
931 if (clen > 0 && (c >= 256 ||
932 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
933 { ADD_NEW(state_offset + 1, 0); }
934 break;
935
936 /*-----------------------------------------------------------------*/
937 case OP_WORD_BOUNDARY:
938 case OP_NOT_WORD_BOUNDARY:
939 {
940 int left_word, right_word;
941
942 if (ptr > start_subject)
943 {
944 const uschar *temp = ptr - 1;
945 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
946 #ifdef SUPPORT_UTF8
947 if (utf8) BACKCHAR(temp);
948 #endif
949 GETCHARTEST(d, temp);
950 #ifdef SUPPORT_UCP
951 if ((md->poptions & PCRE_UCP) != 0)
952 {
953 if (d == '_') left_word = TRUE; else
954 {
955 int cat = UCD_CATEGORY(d);
956 left_word = (cat == ucp_L || cat == ucp_N);
957 }
958 }
959 else
960 #endif
961 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
962 }
963 else left_word = FALSE;
964
965 if (clen > 0)
966 {
967 #ifdef SUPPORT_UCP
968 if ((md->poptions & PCRE_UCP) != 0)
969 {
970 if (c == '_') right_word = TRUE; else
971 {
972 int cat = UCD_CATEGORY(c);
973 right_word = (cat == ucp_L || cat == ucp_N);
974 }
975 }
976 else
977 #endif
978 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
979 }
980 else right_word = FALSE;
981
982 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
983 { ADD_ACTIVE(state_offset + 1, 0); }
984 }
985 break;
986
987
988 /*-----------------------------------------------------------------*/
989 /* Check the next character by Unicode property. We will get here only
990 if the support is in the binary; otherwise a compile-time error occurs.
991 */
992
993 #ifdef SUPPORT_UCP
994 case OP_PROP:
995 case OP_NOTPROP:
996 if (clen > 0)
997 {
998 BOOL OK;
999 const ucd_record * prop = GET_UCD(c);
1000 switch(code[1])
1001 {
1002 case PT_ANY:
1003 OK = TRUE;
1004 break;
1005
1006 case PT_LAMP:
1007 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1008 prop->chartype == ucp_Lt;
1009 break;
1010
1011 case PT_GC:
1012 OK = _pcre_ucp_gentype[prop->chartype] == code[2];
1013 break;
1014
1015 case PT_PC:
1016 OK = prop->chartype == code[2];
1017 break;
1018
1019 case PT_SC:
1020 OK = prop->script == code[2];
1021 break;
1022
1023 /* These are specials for combination cases. */
1024
1025 case PT_ALNUM:
1026 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1027 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1028 break;
1029
1030 case PT_SPACE: /* Perl space */
1031 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1032 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1033 break;
1034
1035 case PT_PXSPACE: /* POSIX space */
1036 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1037 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1038 c == CHAR_FF || c == CHAR_CR;
1039 break;
1040
1041 case PT_WORD:
1042 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1043 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1044 c == CHAR_UNDERSCORE;
1045 break;
1046
1047 /* Should never occur, but keep compilers from grumbling. */
1048
1049 default:
1050 OK = codevalue != OP_PROP;
1051 break;
1052 }
1053
1054 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1055 }
1056 break;
1057 #endif
1058
1059
1060
1061 /* ========================================================================== */
1062 /* These opcodes likewise inspect the subject character, but have an
1063 argument that is not a data character. It is one of these opcodes:
1064 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1065 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1066
1067 case OP_TYPEPLUS:
1068 case OP_TYPEMINPLUS:
1069 case OP_TYPEPOSPLUS:
1070 count = current_state->count; /* Already matched */
1071 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1072 if (clen > 0)
1073 {
1074 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1075 (c < 256 &&
1076 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1077 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1078 {
1079 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1080 {
1081 active_count--; /* Remove non-match possibility */
1082 next_active_state--;
1083 }
1084 count++;
1085 ADD_NEW(state_offset, count);
1086 }
1087 }
1088 break;
1089
1090 /*-----------------------------------------------------------------*/
1091 case OP_TYPEQUERY:
1092 case OP_TYPEMINQUERY:
1093 case OP_TYPEPOSQUERY:
1094 ADD_ACTIVE(state_offset + 2, 0);
1095 if (clen > 0)
1096 {
1097 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1098 (c < 256 &&
1099 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1100 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1101 {
1102 if (codevalue == OP_TYPEPOSQUERY)
1103 {
1104 active_count--; /* Remove non-match possibility */
1105 next_active_state--;
1106 }
1107 ADD_NEW(state_offset + 2, 0);
1108 }
1109 }
1110 break;
1111
1112 /*-----------------------------------------------------------------*/
1113 case OP_TYPESTAR:
1114 case OP_TYPEMINSTAR:
1115 case OP_TYPEPOSSTAR:
1116 ADD_ACTIVE(state_offset + 2, 0);
1117 if (clen > 0)
1118 {
1119 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1120 (c < 256 &&
1121 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1122 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1123 {
1124 if (codevalue == OP_TYPEPOSSTAR)
1125 {
1126 active_count--; /* Remove non-match possibility */
1127 next_active_state--;
1128 }
1129 ADD_NEW(state_offset, 0);
1130 }
1131 }
1132 break;
1133
1134 /*-----------------------------------------------------------------*/
1135 case OP_TYPEEXACT:
1136 count = current_state->count; /* Number already matched */
1137 if (clen > 0)
1138 {
1139 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1140 (c < 256 &&
1141 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1142 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1143 {
1144 if (++count >= GET2(code, 1))
1145 { ADD_NEW(state_offset + 4, 0); }
1146 else
1147 { ADD_NEW(state_offset, count); }
1148 }
1149 }
1150 break;
1151
1152 /*-----------------------------------------------------------------*/
1153 case OP_TYPEUPTO:
1154 case OP_TYPEMINUPTO:
1155 case OP_TYPEPOSUPTO:
1156 ADD_ACTIVE(state_offset + 4, 0);
1157 count = current_state->count; /* Number already matched */
1158 if (clen > 0)
1159 {
1160 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1161 (c < 256 &&
1162 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1163 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1164 {
1165 if (codevalue == OP_TYPEPOSUPTO)
1166 {
1167 active_count--; /* Remove non-match possibility */
1168 next_active_state--;
1169 }
1170 if (++count >= GET2(code, 1))
1171 { ADD_NEW(state_offset + 4, 0); }
1172 else
1173 { ADD_NEW(state_offset, count); }
1174 }
1175 }
1176 break;
1177
1178 /* ========================================================================== */
1179 /* These are virtual opcodes that are used when something like
1180 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1181 argument. It keeps the code above fast for the other cases. The argument
1182 is in the d variable. */
1183
1184 #ifdef SUPPORT_UCP
1185 case OP_PROP_EXTRA + OP_TYPEPLUS:
1186 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1187 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1188 count = current_state->count; /* Already matched */
1189 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1190 if (clen > 0)
1191 {
1192 BOOL OK;
1193 const ucd_record * prop = GET_UCD(c);
1194 switch(code[2])
1195 {
1196 case PT_ANY:
1197 OK = TRUE;
1198 break;
1199
1200 case PT_LAMP:
1201 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1202 prop->chartype == ucp_Lt;
1203 break;
1204
1205 case PT_GC:
1206 OK = _pcre_ucp_gentype[prop->chartype] == code[3];
1207 break;
1208
1209 case PT_PC:
1210 OK = prop->chartype == code[3];
1211 break;
1212
1213 case PT_SC:
1214 OK = prop->script == code[3];
1215 break;
1216
1217 /* These are specials for combination cases. */
1218
1219 case PT_ALNUM:
1220 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1221 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1222 break;
1223
1224 case PT_SPACE: /* Perl space */
1225 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1226 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1227 break;
1228
1229 case PT_PXSPACE: /* POSIX space */
1230 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1231 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1232 c == CHAR_FF || c == CHAR_CR;
1233 break;
1234
1235 case PT_WORD:
1236 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1237 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1238 c == CHAR_UNDERSCORE;
1239 break;
1240
1241 /* Should never occur, but keep compilers from grumbling. */
1242
1243 default:
1244 OK = codevalue != OP_PROP;
1245 break;
1246 }
1247
1248 if (OK == (d == OP_PROP))
1249 {
1250 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1251 {
1252 active_count--; /* Remove non-match possibility */
1253 next_active_state--;
1254 }
1255 count++;
1256 ADD_NEW(state_offset, count);
1257 }
1258 }
1259 break;
1260
1261 /*-----------------------------------------------------------------*/
1262 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1263 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1264 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1265 count = current_state->count; /* Already matched */
1266 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1267 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1268 {
1269 const uschar *nptr = ptr + clen;
1270 int ncount = 0;
1271 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1272 {
1273 active_count--; /* Remove non-match possibility */
1274 next_active_state--;
1275 }
1276 while (nptr < end_subject)
1277 {
1278 int nd;
1279 int ndlen = 1;
1280 GETCHARLEN(nd, nptr, ndlen);
1281 if (UCD_CATEGORY(nd) != ucp_M) break;
1282 ncount++;
1283 nptr += ndlen;
1284 }
1285 count++;
1286 ADD_NEW_DATA(-state_offset, count, ncount);
1287 }
1288 break;
1289 #endif
1290
1291 /*-----------------------------------------------------------------*/
1292 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1293 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1294 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1295 count = current_state->count; /* Already matched */
1296 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1297 if (clen > 0)
1298 {
1299 int ncount = 0;
1300 switch (c)
1301 {
1302 case 0x000b:
1303 case 0x000c:
1304 case 0x0085:
1305 case 0x2028:
1306 case 0x2029:
1307 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1308 goto ANYNL01;
1309
1310 case 0x000d:
1311 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1312 /* Fall through */
1313
1314 ANYNL01:
1315 case 0x000a:
1316 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1317 {
1318 active_count--; /* Remove non-match possibility */
1319 next_active_state--;
1320 }
1321 count++;
1322 ADD_NEW_DATA(-state_offset, count, ncount);
1323 break;
1324
1325 default:
1326 break;
1327 }
1328 }
1329 break;
1330
1331 /*-----------------------------------------------------------------*/
1332 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1333 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1334 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1335 count = current_state->count; /* Already matched */
1336 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1337 if (clen > 0)
1338 {
1339 BOOL OK;
1340 switch (c)
1341 {
1342 case 0x000a:
1343 case 0x000b:
1344 case 0x000c:
1345 case 0x000d:
1346 case 0x0085:
1347 case 0x2028:
1348 case 0x2029:
1349 OK = TRUE;
1350 break;
1351
1352 default:
1353 OK = FALSE;
1354 break;
1355 }
1356
1357 if (OK == (d == OP_VSPACE))
1358 {
1359 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1360 {
1361 active_count--; /* Remove non-match possibility */
1362 next_active_state--;
1363 }
1364 count++;
1365 ADD_NEW_DATA(-state_offset, count, 0);
1366 }
1367 }
1368 break;
1369
1370 /*-----------------------------------------------------------------*/
1371 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1372 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1373 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1374 count = current_state->count; /* Already matched */
1375 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1376 if (clen > 0)
1377 {
1378 BOOL OK;
1379 switch (c)
1380 {
1381 case 0x09: /* HT */
1382 case 0x20: /* SPACE */
1383 case 0xa0: /* NBSP */
1384 case 0x1680: /* OGHAM SPACE MARK */
1385 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1386 case 0x2000: /* EN QUAD */
1387 case 0x2001: /* EM QUAD */
1388 case 0x2002: /* EN SPACE */
1389 case 0x2003: /* EM SPACE */
1390 case 0x2004: /* THREE-PER-EM SPACE */
1391 case 0x2005: /* FOUR-PER-EM SPACE */
1392 case 0x2006: /* SIX-PER-EM SPACE */
1393 case 0x2007: /* FIGURE SPACE */
1394 case 0x2008: /* PUNCTUATION SPACE */
1395 case 0x2009: /* THIN SPACE */
1396 case 0x200A: /* HAIR SPACE */
1397 case 0x202f: /* NARROW NO-BREAK SPACE */
1398 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1399 case 0x3000: /* IDEOGRAPHIC SPACE */
1400 OK = TRUE;
1401 break;
1402
1403 default:
1404 OK = FALSE;
1405 break;
1406 }
1407
1408 if (OK == (d == OP_HSPACE))
1409 {
1410 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1411 {
1412 active_count--; /* Remove non-match possibility */
1413 next_active_state--;
1414 }
1415 count++;
1416 ADD_NEW_DATA(-state_offset, count, 0);
1417 }
1418 }
1419 break;
1420
1421 /*-----------------------------------------------------------------*/
1422 #ifdef SUPPORT_UCP
1423 case OP_PROP_EXTRA + OP_TYPEQUERY:
1424 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1425 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1426 count = 4;
1427 goto QS1;
1428
1429 case OP_PROP_EXTRA + OP_TYPESTAR:
1430 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1431 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1432 count = 0;
1433
1434 QS1:
1435
1436 ADD_ACTIVE(state_offset + 4, 0);
1437 if (clen > 0)
1438 {
1439 BOOL OK;
1440 const ucd_record * prop = GET_UCD(c);
1441 switch(code[2])
1442 {
1443 case PT_ANY:
1444 OK = TRUE;
1445 break;
1446
1447 case PT_LAMP:
1448 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1449 prop->chartype == ucp_Lt;
1450 break;
1451
1452 case PT_GC:
1453 OK = _pcre_ucp_gentype[prop->chartype] == code[3];
1454 break;
1455
1456 case PT_PC:
1457 OK = prop->chartype == code[3];
1458 break;
1459
1460 case PT_SC:
1461 OK = prop->script == code[3];
1462 break;
1463
1464 /* These are specials for combination cases. */
1465
1466 case PT_ALNUM:
1467 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1468 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1469 break;
1470
1471 case PT_SPACE: /* Perl space */
1472 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1473 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1474 break;
1475
1476 case PT_PXSPACE: /* POSIX space */
1477 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1478 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1479 c == CHAR_FF || c == CHAR_CR;
1480 break;
1481
1482 case PT_WORD:
1483 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1484 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1485 c == CHAR_UNDERSCORE;
1486 break;
1487
1488 /* Should never occur, but keep compilers from grumbling. */
1489
1490 default:
1491 OK = codevalue != OP_PROP;
1492 break;
1493 }
1494
1495 if (OK == (d == OP_PROP))
1496 {
1497 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1498 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1499 {
1500 active_count--; /* Remove non-match possibility */
1501 next_active_state--;
1502 }
1503 ADD_NEW(state_offset + count, 0);
1504 }
1505 }
1506 break;
1507
1508 /*-----------------------------------------------------------------*/
1509 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1510 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1511 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1512 count = 2;
1513 goto QS2;
1514
1515 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1516 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1517 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1518 count = 0;
1519
1520 QS2:
1521
1522 ADD_ACTIVE(state_offset + 2, 0);
1523 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1524 {
1525 const uschar *nptr = ptr + clen;
1526 int ncount = 0;
1527 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1528 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1529 {
1530 active_count--; /* Remove non-match possibility */
1531 next_active_state--;
1532 }
1533 while (nptr < end_subject)
1534 {
1535 int nd;
1536 int ndlen = 1;
1537 GETCHARLEN(nd, nptr, ndlen);
1538 if (UCD_CATEGORY(nd) != ucp_M) break;
1539 ncount++;
1540 nptr += ndlen;
1541 }
1542 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1543 }
1544 break;
1545 #endif
1546
1547 /*-----------------------------------------------------------------*/
1548 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1549 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1550 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1551 count = 2;
1552 goto QS3;
1553
1554 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1555 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1556 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1557 count = 0;
1558
1559 QS3:
1560 ADD_ACTIVE(state_offset + 2, 0);
1561 if (clen > 0)
1562 {
1563 int ncount = 0;
1564 switch (c)
1565 {
1566 case 0x000b:
1567 case 0x000c:
1568 case 0x0085:
1569 case 0x2028:
1570 case 0x2029:
1571 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1572 goto ANYNL02;
1573
1574 case 0x000d:
1575 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1576 /* Fall through */
1577
1578 ANYNL02:
1579 case 0x000a:
1580 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1581 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1582 {
1583 active_count--; /* Remove non-match possibility */
1584 next_active_state--;
1585 }
1586 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1587 break;
1588
1589 default:
1590 break;
1591 }
1592 }
1593 break;
1594
1595 /*-----------------------------------------------------------------*/
1596 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1597 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1598 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1599 count = 2;
1600 goto QS4;
1601
1602 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1603 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1604 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1605 count = 0;
1606
1607 QS4:
1608 ADD_ACTIVE(state_offset + 2, 0);
1609 if (clen > 0)
1610 {
1611 BOOL OK;
1612 switch (c)
1613 {
1614 case 0x000a:
1615 case 0x000b:
1616 case 0x000c:
1617 case 0x000d:
1618 case 0x0085:
1619 case 0x2028:
1620 case 0x2029:
1621 OK = TRUE;
1622 break;
1623
1624 default:
1625 OK = FALSE;
1626 break;
1627 }
1628 if (OK == (d == OP_VSPACE))
1629 {
1630 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1631 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1632 {
1633 active_count--; /* Remove non-match possibility */
1634 next_active_state--;
1635 }
1636 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1637 }
1638 }
1639 break;
1640
1641 /*-----------------------------------------------------------------*/
1642 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1643 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1644 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1645 count = 2;
1646 goto QS5;
1647
1648 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1649 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1650 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1651 count = 0;
1652
1653 QS5:
1654 ADD_ACTIVE(state_offset + 2, 0);
1655 if (clen > 0)
1656 {
1657 BOOL OK;
1658 switch (c)
1659 {
1660 case 0x09: /* HT */
1661 case 0x20: /* SPACE */
1662 case 0xa0: /* NBSP */
1663 case 0x1680: /* OGHAM SPACE MARK */
1664 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1665 case 0x2000: /* EN QUAD */
1666 case 0x2001: /* EM QUAD */
1667 case 0x2002: /* EN SPACE */
1668 case 0x2003: /* EM SPACE */
1669 case 0x2004: /* THREE-PER-EM SPACE */
1670 case 0x2005: /* FOUR-PER-EM SPACE */
1671 case 0x2006: /* SIX-PER-EM SPACE */
1672 case 0x2007: /* FIGURE SPACE */
1673 case 0x2008: /* PUNCTUATION SPACE */
1674 case 0x2009: /* THIN SPACE */
1675 case 0x200A: /* HAIR SPACE */
1676 case 0x202f: /* NARROW NO-BREAK SPACE */
1677 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1678 case 0x3000: /* IDEOGRAPHIC SPACE */
1679 OK = TRUE;
1680 break;
1681
1682 default:
1683 OK = FALSE;
1684 break;
1685 }
1686
1687 if (OK == (d == OP_HSPACE))
1688 {
1689 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1690 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1691 {
1692 active_count--; /* Remove non-match possibility */
1693 next_active_state--;
1694 }
1695 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1696 }
1697 }
1698 break;
1699
1700 /*-----------------------------------------------------------------*/
1701 #ifdef SUPPORT_UCP
1702 case OP_PROP_EXTRA + OP_TYPEEXACT:
1703 case OP_PROP_EXTRA + OP_TYPEUPTO:
1704 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1705 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1706 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1707 { ADD_ACTIVE(state_offset + 6, 0); }
1708 count = current_state->count; /* Number already matched */
1709 if (clen > 0)
1710 {
1711 BOOL OK;
1712 const ucd_record * prop = GET_UCD(c);
1713 switch(code[4])
1714 {
1715 case PT_ANY:
1716 OK = TRUE;
1717 break;
1718
1719 case PT_LAMP:
1720 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1721 prop->chartype == ucp_Lt;
1722 break;
1723
1724 case PT_GC:
1725 OK = _pcre_ucp_gentype[prop->chartype] == code[5];
1726 break;
1727
1728 case PT_PC:
1729 OK = prop->chartype == code[5];
1730 break;
1731
1732 case PT_SC:
1733 OK = prop->script == code[5];
1734 break;
1735
1736 /* These are specials for combination cases. */
1737
1738 case PT_ALNUM:
1739 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1740 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1741 break;
1742
1743 case PT_SPACE: /* Perl space */
1744 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1745 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1746 break;
1747
1748 case PT_PXSPACE: /* POSIX space */
1749 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1750 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1751 c == CHAR_FF || c == CHAR_CR;
1752 break;
1753
1754 case PT_WORD:
1755 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1756 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1757 c == CHAR_UNDERSCORE;
1758 break;
1759
1760 /* Should never occur, but keep compilers from grumbling. */
1761
1762 default:
1763 OK = codevalue != OP_PROP;
1764 break;
1765 }
1766
1767 if (OK == (d == OP_PROP))
1768 {
1769 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1770 {
1771 active_count--; /* Remove non-match possibility */
1772 next_active_state--;
1773 }
1774 if (++count >= GET2(code, 1))
1775 { ADD_NEW(state_offset + 6, 0); }
1776 else
1777 { ADD_NEW(state_offset, count); }
1778 }
1779 }
1780 break;
1781
1782 /*-----------------------------------------------------------------*/
1783 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1784 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1785 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1786 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1787 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1788 { ADD_ACTIVE(state_offset + 4, 0); }
1789 count = current_state->count; /* Number already matched */
1790 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1791 {
1792 const uschar *nptr = ptr + clen;
1793 int ncount = 0;
1794 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1795 {
1796 active_count--; /* Remove non-match possibility */
1797 next_active_state--;
1798 }
1799 while (nptr < end_subject)
1800 {
1801 int nd;
1802 int ndlen = 1;
1803 GETCHARLEN(nd, nptr, ndlen);
1804 if (UCD_CATEGORY(nd) != ucp_M) break;
1805 ncount++;
1806 nptr += ndlen;
1807 }
1808 if (++count >= GET2(code, 1))
1809 { ADD_NEW_DATA(-(state_offset + 4), 0, ncount); }
1810 else
1811 { ADD_NEW_DATA(-state_offset, count, ncount); }
1812 }
1813 break;
1814 #endif
1815
1816 /*-----------------------------------------------------------------*/
1817 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1818 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1819 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1820 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1821 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1822 { ADD_ACTIVE(state_offset + 4, 0); }
1823 count = current_state->count; /* Number already matched */
1824 if (clen > 0)
1825 {
1826 int ncount = 0;
1827 switch (c)
1828 {
1829 case 0x000b:
1830 case 0x000c:
1831 case 0x0085:
1832 case 0x2028:
1833 case 0x2029:
1834 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1835 goto ANYNL03;
1836
1837 case 0x000d:
1838 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1839 /* Fall through */
1840
1841 ANYNL03:
1842 case 0x000a:
1843 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1844 {
1845 active_count--; /* Remove non-match possibility */
1846 next_active_state--;
1847 }
1848 if (++count >= GET2(code, 1))
1849 { ADD_NEW_DATA(-(state_offset + 4), 0, ncount); }
1850 else
1851 { ADD_NEW_DATA(-state_offset, count, ncount); }
1852 break;
1853
1854 default:
1855 break;
1856 }
1857 }
1858 break;
1859
1860 /*-----------------------------------------------------------------*/
1861 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1862 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1863 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1864 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1865 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1866 { ADD_ACTIVE(state_offset + 4, 0); }
1867 count = current_state->count; /* Number already matched */
1868 if (clen > 0)
1869 {
1870 BOOL OK;
1871 switch (c)
1872 {
1873 case 0x000a:
1874 case 0x000b:
1875 case 0x000c:
1876 case 0x000d:
1877 case 0x0085:
1878 case 0x2028:
1879 case 0x2029:
1880 OK = TRUE;
1881 break;
1882
1883 default:
1884 OK = FALSE;
1885 }
1886
1887 if (OK == (d == OP_VSPACE))
1888 {
1889 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1890 {
1891 active_count--; /* Remove non-match possibility */
1892 next_active_state--;
1893 }
1894 if (++count >= GET2(code, 1))
1895 { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
1896 else
1897 { ADD_NEW_DATA(-state_offset, count, 0); }
1898 }
1899 }
1900 break;
1901
1902 /*-----------------------------------------------------------------*/
1903 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
1904 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
1905 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
1906 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
1907 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
1908 { ADD_ACTIVE(state_offset + 4, 0); }
1909 count = current_state->count; /* Number already matched */
1910 if (clen > 0)
1911 {
1912 BOOL OK;
1913 switch (c)
1914 {
1915 case 0x09: /* HT */
1916 case 0x20: /* SPACE */
1917 case 0xa0: /* NBSP */
1918 case 0x1680: /* OGHAM SPACE MARK */
1919 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1920 case 0x2000: /* EN QUAD */
1921 case 0x2001: /* EM QUAD */
1922 case 0x2002: /* EN SPACE */
1923 case 0x2003: /* EM SPACE */
1924 case 0x2004: /* THREE-PER-EM SPACE */
1925 case 0x2005: /* FOUR-PER-EM SPACE */
1926 case 0x2006: /* SIX-PER-EM SPACE */
1927 case 0x2007: /* FIGURE SPACE */
1928 case 0x2008: /* PUNCTUATION SPACE */
1929 case 0x2009: /* THIN SPACE */
1930 case 0x200A: /* HAIR SPACE */
1931 case 0x202f: /* NARROW NO-BREAK SPACE */
1932 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1933 case 0x3000: /* IDEOGRAPHIC SPACE */
1934 OK = TRUE;
1935 break;
1936
1937 default:
1938 OK = FALSE;
1939 break;
1940 }
1941
1942 if (OK == (d == OP_HSPACE))
1943 {
1944 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
1945 {
1946 active_count--; /* Remove non-match possibility */
1947 next_active_state--;
1948 }
1949 if (++count >= GET2(code, 1))
1950 { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
1951 else
1952 { ADD_NEW_DATA(-state_offset, count, 0); }
1953 }
1954 }
1955 break;
1956
1957 /* ========================================================================== */
1958 /* These opcodes are followed by a character that is usually compared
1959 to the current subject character; it is loaded into d. We still get
1960 here even if there is no subject character, because in some cases zero
1961 repetitions are permitted. */
1962
1963 /*-----------------------------------------------------------------*/
1964 case OP_CHAR:
1965 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
1966 break;
1967
1968 /*-----------------------------------------------------------------*/
1969 case OP_CHARI:
1970 if (clen == 0) break;
1971
1972 #ifdef SUPPORT_UTF8
1973 if (utf8)
1974 {
1975 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
1976 {
1977 unsigned int othercase;
1978 if (c < 128) othercase = fcc[c]; else
1979
1980 /* If we have Unicode property support, we can use it to test the
1981 other case of the character. */
1982
1983 #ifdef SUPPORT_UCP
1984 othercase = UCD_OTHERCASE(c);
1985 #else
1986 othercase = NOTACHAR;
1987 #endif
1988
1989 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
1990 }
1991 }
1992 else
1993 #endif /* SUPPORT_UTF8 */
1994
1995 /* Non-UTF-8 mode */
1996 {
1997 if (lcc[c] == lcc[d]) { ADD_NEW(state_offset + 2, 0); }
1998 }
1999 break;
2000
2001
2002 #ifdef SUPPORT_UCP
2003 /*-----------------------------------------------------------------*/
2004 /* This is a tricky one because it can match more than one character.
2005 Find out how many characters to skip, and then set up a negative state
2006 to wait for them to pass before continuing. */
2007
2008 case OP_EXTUNI:
2009 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
2010 {
2011 const uschar *nptr = ptr + clen;
2012 int ncount = 0;
2013 while (nptr < end_subject)
2014 {
2015 int nclen = 1;
2016 GETCHARLEN(c, nptr, nclen);
2017 if (UCD_CATEGORY(c) != ucp_M) break;
2018 ncount++;
2019 nptr += nclen;
2020 }
2021 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2022 }
2023 break;
2024 #endif
2025
2026 /*-----------------------------------------------------------------*/
2027 /* This is a tricky like EXTUNI because it too can match more than one
2028 character (when CR is followed by LF). In this case, set up a negative
2029 state to wait for one character to pass before continuing. */
2030
2031 case OP_ANYNL:
2032 if (clen > 0) switch(c)
2033 {
2034 case 0x000b:
2035 case 0x000c:
2036 case 0x0085:
2037 case 0x2028:
2038 case 0x2029:
2039 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2040
2041 case 0x000a:
2042 ADD_NEW(state_offset + 1, 0);
2043 break;
2044
2045 case 0x000d:
2046 if (ptr + 1 < end_subject && ptr[1] == 0x0a)
2047 {
2048 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2049 }
2050 else
2051 {
2052 ADD_NEW(state_offset + 1, 0);
2053 }
2054 break;
2055 }
2056 break;
2057
2058 /*-----------------------------------------------------------------*/
2059 case OP_NOT_VSPACE:
2060 if (clen > 0) switch(c)
2061 {
2062 case 0x000a:
2063 case 0x000b:
2064 case 0x000c:
2065 case 0x000d:
2066 case 0x0085:
2067 case 0x2028:
2068 case 0x2029:
2069 break;
2070
2071 default:
2072 ADD_NEW(state_offset + 1, 0);
2073 break;
2074 }
2075 break;
2076
2077 /*-----------------------------------------------------------------*/
2078 case OP_VSPACE:
2079 if (clen > 0) switch(c)
2080 {
2081 case 0x000a:
2082 case 0x000b:
2083 case 0x000c:
2084 case 0x000d:
2085 case 0x0085:
2086 case 0x2028:
2087 case 0x2029:
2088 ADD_NEW(state_offset + 1, 0);
2089 break;
2090
2091 default: break;
2092 }
2093 break;
2094
2095 /*-----------------------------------------------------------------*/
2096 case OP_NOT_HSPACE:
2097 if (clen > 0) switch(c)
2098 {
2099 case 0x09: /* HT */
2100 case 0x20: /* SPACE */
2101 case 0xa0: /* NBSP */
2102 case 0x1680: /* OGHAM SPACE MARK */
2103 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2104 case 0x2000: /* EN QUAD */
2105 case 0x2001: /* EM QUAD */
2106 case 0x2002: /* EN SPACE */
2107 case 0x2003: /* EM SPACE */
2108 case 0x2004: /* THREE-PER-EM SPACE */
2109 case 0x2005: /* FOUR-PER-EM SPACE */
2110 case 0x2006: /* SIX-PER-EM SPACE */
2111 case 0x2007: /* FIGURE SPACE */
2112 case 0x2008: /* PUNCTUATION SPACE */
2113 case 0x2009: /* THIN SPACE */
2114 case 0x200A: /* HAIR SPACE */
2115 case 0x202f: /* NARROW NO-BREAK SPACE */
2116 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2117 case 0x3000: /* IDEOGRAPHIC SPACE */
2118 break;
2119
2120 default:
2121 ADD_NEW(state_offset + 1, 0);
2122 break;
2123 }
2124 break;
2125
2126 /*-----------------------------------------------------------------*/
2127 case OP_HSPACE:
2128 if (clen > 0) switch(c)
2129 {
2130 case 0x09: /* HT */
2131 case 0x20: /* SPACE */
2132 case 0xa0: /* NBSP */
2133 case 0x1680: /* OGHAM SPACE MARK */
2134 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2135 case 0x2000: /* EN QUAD */
2136 case 0x2001: /* EM QUAD */
2137 case 0x2002: /* EN SPACE */
2138 case 0x2003: /* EM SPACE */
2139 case 0x2004: /* THREE-PER-EM SPACE */
2140 case 0x2005: /* FOUR-PER-EM SPACE */
2141 case 0x2006: /* SIX-PER-EM SPACE */
2142 case 0x2007: /* FIGURE SPACE */
2143 case 0x2008: /* PUNCTUATION SPACE */
2144 case 0x2009: /* THIN SPACE */
2145 case 0x200A: /* HAIR SPACE */
2146 case 0x202f: /* NARROW NO-BREAK SPACE */
2147 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2148 case 0x3000: /* IDEOGRAPHIC SPACE */
2149 ADD_NEW(state_offset + 1, 0);
2150 break;
2151 }
2152 break;
2153
2154 /*-----------------------------------------------------------------*/
2155 /* Match a negated single character casefully. This is only used for
2156 one-byte characters, that is, we know that d < 256. The character we are
2157 checking (c) can be multibyte. */
2158
2159 case OP_NOT:
2160 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2161 break;
2162
2163 /*-----------------------------------------------------------------*/
2164 /* Match a negated single character caselessly. This is only used for
2165 one-byte characters, that is, we know that d < 256. The character we are
2166 checking (c) can be multibyte. */
2167
2168 case OP_NOTI:
2169 if (clen > 0 && c != d && c != fcc[d])
2170 { ADD_NEW(state_offset + dlen + 1, 0); }
2171 break;
2172
2173 /*-----------------------------------------------------------------*/
2174 case OP_PLUSI:
2175 case OP_MINPLUSI:
2176 case OP_POSPLUSI:
2177 case OP_NOTPLUSI:
2178 case OP_NOTMINPLUSI:
2179 case OP_NOTPOSPLUSI:
2180 caseless = TRUE;
2181 codevalue -= OP_STARI - OP_STAR;
2182
2183 /* Fall through */
2184 case OP_PLUS:
2185 case OP_MINPLUS:
2186 case OP_POSPLUS:
2187 case OP_NOTPLUS:
2188 case OP_NOTMINPLUS:
2189 case OP_NOTPOSPLUS:
2190 count = current_state->count; /* Already matched */
2191 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2192 if (clen > 0)
2193 {
2194 unsigned int otherd = NOTACHAR;
2195 if (caseless)
2196 {
2197 #ifdef SUPPORT_UTF8
2198 if (utf8 && d >= 128)
2199 {
2200 #ifdef SUPPORT_UCP
2201 otherd = UCD_OTHERCASE(d);
2202 #endif /* SUPPORT_UCP */
2203 }
2204 else
2205 #endif /* SUPPORT_UTF8 */
2206 otherd = fcc[d];
2207 }
2208 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2209 {
2210 if (count > 0 &&
2211 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2212 {
2213 active_count--; /* Remove non-match possibility */
2214 next_active_state--;
2215 }
2216 count++;
2217 ADD_NEW(state_offset, count);
2218 }
2219 }
2220 break;
2221
2222 /*-----------------------------------------------------------------*/
2223 case OP_QUERYI:
2224 case OP_MINQUERYI:
2225 case OP_POSQUERYI:
2226 case OP_NOTQUERYI:
2227 case OP_NOTMINQUERYI:
2228 case OP_NOTPOSQUERYI:
2229 caseless = TRUE;
2230 codevalue -= OP_STARI - OP_STAR;
2231 /* Fall through */
2232 case OP_QUERY:
2233 case OP_MINQUERY:
2234 case OP_POSQUERY:
2235 case OP_NOTQUERY:
2236 case OP_NOTMINQUERY:
2237 case OP_NOTPOSQUERY:
2238 ADD_ACTIVE(state_offset + dlen + 1, 0);
2239 if (clen > 0)
2240 {
2241 unsigned int otherd = NOTACHAR;
2242 if (caseless)
2243 {
2244 #ifdef SUPPORT_UTF8
2245 if (utf8 && d >= 128)
2246 {
2247 #ifdef SUPPORT_UCP
2248 otherd = UCD_OTHERCASE(d);
2249 #endif /* SUPPORT_UCP */
2250 }
2251 else
2252 #endif /* SUPPORT_UTF8 */
2253 otherd = fcc[d];
2254 }
2255 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2256 {
2257 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2258 {
2259 active_count--; /* Remove non-match possibility */
2260 next_active_state--;
2261 }
2262 ADD_NEW(state_offset + dlen + 1, 0);
2263 }
2264 }
2265 break;
2266
2267 /*-----------------------------------------------------------------*/
2268 case OP_STARI:
2269 case OP_MINSTARI:
2270 case OP_POSSTARI:
2271 case OP_NOTSTARI:
2272 case OP_NOTMINSTARI:
2273 case OP_NOTPOSSTARI:
2274 caseless = TRUE;
2275 codevalue -= OP_STARI - OP_STAR;
2276 /* Fall through */
2277 case OP_STAR:
2278 case OP_MINSTAR:
2279 case OP_POSSTAR:
2280 case OP_NOTSTAR:
2281 case OP_NOTMINSTAR:
2282 case OP_NOTPOSSTAR:
2283 ADD_ACTIVE(state_offset + dlen + 1, 0);
2284 if (clen > 0)
2285 {
2286 unsigned int otherd = NOTACHAR;
2287 if (caseless)
2288 {
2289 #ifdef SUPPORT_UTF8
2290 if (utf8 && d >= 128)
2291 {
2292 #ifdef SUPPORT_UCP
2293 otherd = UCD_OTHERCASE(d);
2294 #endif /* SUPPORT_UCP */
2295 }
2296 else
2297 #endif /* SUPPORT_UTF8 */
2298 otherd = fcc[d];
2299 }
2300 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2301 {
2302 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2303 {
2304 active_count--; /* Remove non-match possibility */
2305 next_active_state--;
2306 }
2307 ADD_NEW(state_offset, 0);
2308 }
2309 }
2310 break;
2311
2312 /*-----------------------------------------------------------------*/
2313 case OP_EXACTI:
2314 case OP_NOTEXACTI:
2315 caseless = TRUE;
2316 codevalue -= OP_STARI - OP_STAR;
2317 /* Fall through */
2318 case OP_EXACT:
2319 case OP_NOTEXACT:
2320 count = current_state->count; /* Number already matched */
2321 if (clen > 0)
2322 {
2323 unsigned int otherd = NOTACHAR;
2324 if (caseless)
2325 {
2326 #ifdef SUPPORT_UTF8
2327 if (utf8 && d >= 128)
2328 {
2329 #ifdef SUPPORT_UCP
2330 otherd = UCD_OTHERCASE(d);
2331 #endif /* SUPPORT_UCP */
2332 }
2333 else
2334 #endif /* SUPPORT_UTF8 */
2335 otherd = fcc[d];
2336 }
2337 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2338 {
2339 if (++count >= GET2(code, 1))
2340 { ADD_NEW(state_offset + dlen + 3, 0); }
2341 else
2342 { ADD_NEW(state_offset, count); }
2343 }
2344 }
2345 break;
2346
2347 /*-----------------------------------------------------------------*/
2348 case OP_UPTOI:
2349 case OP_MINUPTOI:
2350 case OP_POSUPTOI:
2351 case OP_NOTUPTOI:
2352 case OP_NOTMINUPTOI:
2353 case OP_NOTPOSUPTOI:
2354 caseless = TRUE;
2355 codevalue -= OP_STARI - OP_STAR;
2356 /* Fall through */
2357 case OP_UPTO:
2358 case OP_MINUPTO:
2359 case OP_POSUPTO:
2360 case OP_NOTUPTO:
2361 case OP_NOTMINUPTO:
2362 case OP_NOTPOSUPTO:
2363 ADD_ACTIVE(state_offset + dlen + 3, 0);
2364 count = current_state->count; /* Number already matched */
2365 if (clen > 0)
2366 {
2367 unsigned int otherd = NOTACHAR;
2368 if (caseless)
2369 {
2370 #ifdef SUPPORT_UTF8
2371 if (utf8 && d >= 128)
2372 {
2373 #ifdef SUPPORT_UCP
2374 otherd = UCD_OTHERCASE(d);
2375 #endif /* SUPPORT_UCP */
2376 }
2377 else
2378 #endif /* SUPPORT_UTF8 */
2379 otherd = fcc[d];
2380 }
2381 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2382 {
2383 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2384 {
2385 active_count--; /* Remove non-match possibility */
2386 next_active_state--;
2387 }
2388 if (++count >= GET2(code, 1))
2389 { ADD_NEW(state_offset + dlen + 3, 0); }
2390 else
2391 { ADD_NEW(state_offset, count); }
2392 }
2393 }
2394 break;
2395
2396
2397 /* ========================================================================== */
2398 /* These are the class-handling opcodes */
2399
2400 case OP_CLASS:
2401 case OP_NCLASS:
2402 case OP_XCLASS:
2403 {
2404 BOOL isinclass = FALSE;
2405 int next_state_offset;
2406 const uschar *ecode;
2407
2408 /* For a simple class, there is always just a 32-byte table, and we
2409 can set isinclass from it. */
2410
2411 if (codevalue != OP_XCLASS)
2412 {
2413 ecode = code + 33;
2414 if (clen > 0)
2415 {
2416 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2417 ((code[1 + c/8] & (1 << (c&7))) != 0);
2418 }
2419 }
2420
2421 /* An extended class may have a table or a list of single characters,
2422 ranges, or both, and it may be positive or negative. There's a
2423 function that sorts all this out. */
2424
2425 else
2426 {
2427 ecode = code + GET(code, 1);
2428 if (clen > 0) isinclass = _pcre_xclass(c, code + 1 + LINK_SIZE);
2429 }
2430
2431 /* At this point, isinclass is set for all kinds of class, and ecode
2432 points to the byte after the end of the class. If there is a
2433 quantifier, this is where it will be. */
2434
2435 next_state_offset = (int)(ecode - start_code);
2436
2437 switch (*ecode)
2438 {
2439 case OP_CRSTAR:
2440 case OP_CRMINSTAR:
2441 ADD_ACTIVE(next_state_offset + 1, 0);
2442 if (isinclass) { ADD_NEW(state_offset, 0); }
2443 break;
2444
2445 case OP_CRPLUS:
2446 case OP_CRMINPLUS:
2447 count = current_state->count; /* Already matched */
2448 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2449 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2450 break;
2451
2452 case OP_CRQUERY:
2453 case OP_CRMINQUERY:
2454 ADD_ACTIVE(next_state_offset + 1, 0);
2455 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2456 break;
2457
2458 case OP_CRRANGE:
2459 case OP_CRMINRANGE:
2460 count = current_state->count; /* Already matched */
2461 if (count >= GET2(ecode, 1))
2462 { ADD_ACTIVE(next_state_offset + 5, 0); }
2463 if (isinclass)
2464 {
2465 int max = GET2(ecode, 3);
2466 if (++count >= max && max != 0) /* Max 0 => no limit */
2467 { ADD_NEW(next_state_offset + 5, 0); }
2468 else
2469 { ADD_NEW(state_offset, count); }
2470 }
2471 break;
2472
2473 default:
2474 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2475 break;
2476 }
2477 }
2478 break;
2479
2480 /* ========================================================================== */
2481 /* These are the opcodes for fancy brackets of various kinds. We have
2482 to use recursion in order to handle them. The "always failing" assertion
2483 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2484 though the other "backtracking verbs" are not supported. */
2485
2486 case OP_FAIL:
2487 forced_fail++; /* Count FAILs for multiple states */
2488 break;
2489
2490 case OP_ASSERT:
2491 case OP_ASSERT_NOT:
2492 case OP_ASSERTBACK:
2493 case OP_ASSERTBACK_NOT:
2494 {
2495 int rc;
2496 int local_offsets[2];
2497 int local_workspace[1000];
2498 const uschar *endasscode = code + GET(code, 1);
2499
2500 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2501
2502 rc = internal_dfa_exec(
2503 md, /* static match data */
2504 code, /* this subexpression's code */
2505 ptr, /* where we currently are */
2506 (int)(ptr - start_subject), /* start offset */
2507 local_offsets, /* offset vector */
2508 sizeof(local_offsets)/sizeof(int), /* size of same */
2509 local_workspace, /* workspace vector */
2510 sizeof(local_workspace)/sizeof(int), /* size of same */
2511 rlevel, /* function recursion level */
2512 recursing); /* pass on regex recursion */
2513
2514 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2515 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2516 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2517 }
2518 break;
2519
2520 /*-----------------------------------------------------------------*/
2521 case OP_COND:
2522 case OP_SCOND:
2523 {
2524 int local_offsets[1000];
2525 int local_workspace[1000];
2526 int codelink = GET(code, 1);
2527 int condcode;
2528
2529 /* Because of the way auto-callout works during compile, a callout item
2530 is inserted between OP_COND and an assertion condition. This does not
2531 happen for the other conditions. */
2532
2533 if (code[LINK_SIZE+1] == OP_CALLOUT)
2534 {
2535 rrc = 0;
2536 if (pcre_callout != NULL)
2537 {
2538 pcre_callout_block cb;
2539 cb.version = 1; /* Version 1 of the callout block */
2540 cb.callout_number = code[LINK_SIZE+2];
2541 cb.offset_vector = offsets;
2542 cb.subject = (PCRE_SPTR)start_subject;
2543 cb.subject_length = (int)(end_subject - start_subject);
2544 cb.start_match = (int)(current_subject - start_subject);
2545 cb.current_position = (int)(ptr - start_subject);
2546 cb.pattern_position = GET(code, LINK_SIZE + 3);
2547 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2548 cb.capture_top = 1;
2549 cb.capture_last = -1;
2550 cb.callout_data = md->callout_data;
2551 if ((rrc = (*pcre_callout)(&cb)) < 0) return rrc; /* Abandon */
2552 }
2553 if (rrc > 0) break; /* Fail this thread */
2554 code += _pcre_OP_lengths[OP_CALLOUT]; /* Skip callout data */
2555 }
2556
2557 condcode = code[LINK_SIZE+1];
2558
2559 /* Back reference conditions are not supported */
2560
2561 if (condcode == OP_CREF || condcode == OP_NCREF)
2562 return PCRE_ERROR_DFA_UCOND;
2563
2564 /* The DEFINE condition is always false */
2565
2566 if (condcode == OP_DEF)
2567 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2568
2569 /* The only supported version of OP_RREF is for the value RREF_ANY,
2570 which means "test if in any recursion". We can't test for specifically
2571 recursed groups. */
2572
2573 else if (condcode == OP_RREF || condcode == OP_NRREF)
2574 {
2575 int value = GET2(code, LINK_SIZE+2);
2576 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2577 if (recursing > 0)
2578 { ADD_ACTIVE(state_offset + LINK_SIZE + 4, 0); }
2579 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2580 }
2581
2582 /* Otherwise, the condition is an assertion */
2583
2584 else
2585 {
2586 int rc;
2587 const uschar *asscode = code + LINK_SIZE + 1;
2588 const uschar *endasscode = asscode + GET(asscode, 1);
2589
2590 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2591
2592 rc = internal_dfa_exec(
2593 md, /* fixed match data */
2594 asscode, /* this subexpression's code */
2595 ptr, /* where we currently are */
2596 (int)(ptr - start_subject), /* start offset */
2597 local_offsets, /* offset vector */
2598 sizeof(local_offsets)/sizeof(int), /* size of same */
2599 local_workspace, /* workspace vector */
2600 sizeof(local_workspace)/sizeof(int), /* size of same */
2601 rlevel, /* function recursion level */
2602 recursing); /* pass on regex recursion */
2603
2604 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2605 if ((rc >= 0) ==
2606 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2607 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2608 else
2609 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2610 }
2611 }
2612 break;
2613
2614 /*-----------------------------------------------------------------*/
2615 case OP_RECURSE:
2616 {
2617 int local_offsets[1000];
2618 int local_workspace[1000];
2619 int rc;
2620
2621 DPRINTF(("%.*sStarting regex recursion %d\n", rlevel*2-2, SP,
2622 recursing + 1));
2623
2624 rc = internal_dfa_exec(
2625 md, /* fixed match data */
2626 start_code + GET(code, 1), /* this subexpression's code */
2627 ptr, /* where we currently are */
2628 (int)(ptr - start_subject), /* start offset */
2629 local_offsets, /* offset vector */
2630 sizeof(local_offsets)/sizeof(int), /* size of same */
2631 local_workspace, /* workspace vector */
2632 sizeof(local_workspace)/sizeof(int), /* size of same */
2633 rlevel, /* function recursion level */
2634 recursing + 1); /* regex recurse level */
2635
2636 DPRINTF(("%.*sReturn from regex recursion %d: rc=%d\n", rlevel*2-2, SP,
2637 recursing + 1, rc));
2638
2639 /* Ran out of internal offsets */
2640
2641 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2642
2643 /* For each successful matched substring, set up the next state with a
2644 count of characters to skip before trying it. Note that the count is in
2645 characters, not bytes. */
2646
2647 if (rc > 0)
2648 {
2649 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2650 {
2651 const uschar *p = start_subject + local_offsets[rc];
2652 const uschar *pp = start_subject + local_offsets[rc+1];
2653 int charcount = local_offsets[rc+1] - local_offsets[rc];
2654 while (p < pp) if ((*p++ & 0xc0) == 0x80) charcount--;
2655 if (charcount > 0)
2656 {
2657 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2658 }
2659 else
2660 {
2661 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2662 }
2663 }
2664 }
2665 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2666 }
2667 break;
2668
2669 /*-----------------------------------------------------------------*/
2670 case OP_ONCE:
2671 {
2672 int local_offsets[2];
2673 int local_workspace[1000];
2674
2675 int rc = internal_dfa_exec(
2676 md, /* fixed match data */
2677 code, /* this subexpression's code */
2678 ptr, /* where we currently are */
2679 (int)(ptr - start_subject), /* start offset */
2680 local_offsets, /* offset vector */
2681 sizeof(local_offsets)/sizeof(int), /* size of same */
2682 local_workspace, /* workspace vector */
2683 sizeof(local_workspace)/sizeof(int), /* size of same */
2684 rlevel, /* function recursion level */
2685 recursing); /* pass on regex recursion */
2686
2687 if (rc >= 0)
2688 {
2689 const uschar *end_subpattern = code;
2690 int charcount = local_offsets[1] - local_offsets[0];
2691 int next_state_offset, repeat_state_offset;
2692
2693 do { end_subpattern += GET(end_subpattern, 1); }
2694 while (*end_subpattern == OP_ALT);
2695 next_state_offset =
2696 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2697
2698 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2699 arrange for the repeat state also to be added to the relevant list.
2700 Calculate the offset, or set -1 for no repeat. */
2701
2702 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2703 *end_subpattern == OP_KETRMIN)?
2704 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2705
2706 /* If we have matched an empty string, add the next state at the
2707 current character pointer. This is important so that the duplicate
2708 checking kicks in, which is what breaks infinite loops that match an
2709 empty string. */
2710
2711 if (charcount == 0)
2712 {
2713 ADD_ACTIVE(next_state_offset, 0);
2714 }
2715
2716 /* Optimization: if there are no more active states, and there
2717 are no new states yet set up, then skip over the subject string
2718 right here, to save looping. Otherwise, set up the new state to swing
2719 into action when the end of the substring is reached. */
2720
2721 else if (i + 1 >= active_count && new_count == 0)
2722 {
2723 ptr += charcount;
2724 clen = 0;
2725 ADD_NEW(next_state_offset, 0);
2726
2727 /* If we are adding a repeat state at the new character position,
2728 we must fudge things so that it is the only current state.
2729 Otherwise, it might be a duplicate of one we processed before, and
2730 that would cause it to be skipped. */
2731
2732 if (repeat_state_offset >= 0)
2733 {
2734 next_active_state = active_states;
2735 active_count = 0;
2736 i = -1;
2737 ADD_ACTIVE(repeat_state_offset, 0);
2738 }
2739 }
2740 else
2741 {
2742 const uschar *p = start_subject + local_offsets[0];
2743 const uschar *pp = start_subject + local_offsets[1];
2744 while (p < pp) if ((*p++ & 0xc0) == 0x80) charcount--;
2745 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2746 if (repeat_state_offset >= 0)
2747 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2748 }
2749
2750 }
2751 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2752 }
2753 break;
2754
2755
2756 /* ========================================================================== */
2757 /* Handle callouts */
2758
2759 case OP_CALLOUT:
2760 rrc = 0;
2761 if (pcre_callout != NULL)
2762 {
2763 pcre_callout_block cb;
2764 cb.version = 1; /* Version 1 of the callout block */
2765 cb.callout_number = code[1];
2766 cb.offset_vector = offsets;
2767 cb.subject = (PCRE_SPTR)start_subject;
2768 cb.subject_length = (int)(end_subject - start_subject);
2769 cb.start_match = (int)(current_subject - start_subject);
2770 cb.current_position = (int)(ptr - start_subject);
2771 cb.pattern_position = GET(code, 2);
2772 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2773 cb.capture_top = 1;
2774 cb.capture_last = -1;
2775 cb.callout_data = md->callout_data;
2776 if ((rrc = (*pcre_callout)(&cb)) < 0) return rrc; /* Abandon */
2777 }
2778 if (rrc == 0)
2779 { ADD_ACTIVE(state_offset + _pcre_OP_lengths[OP_CALLOUT], 0); }
2780 break;
2781
2782
2783 /* ========================================================================== */
2784 default: /* Unsupported opcode */
2785 return PCRE_ERROR_DFA_UITEM;
2786 }
2787
2788 NEXT_ACTIVE_STATE: continue;
2789
2790 } /* End of loop scanning active states */
2791
2792 /* We have finished the processing at the current subject character. If no
2793 new states have been set for the next character, we have found all the
2794 matches that we are going to find. If we are at the top level and partial
2795 matching has been requested, check for appropriate conditions.
2796
2797 The "forced_ fail" variable counts the number of (*F) encountered for the
2798 character. If it is equal to the original active_count (saved in
2799 workspace[1]) it means that (*F) was found on every active state. In this
2800 case we don't want to give a partial match.
2801
2802 The "could_continue" variable is true if a state could have continued but
2803 for the fact that the end of the subject was reached. */
2804
2805 if (new_count <= 0)
2806 {
2807 if (rlevel == 1 && /* Top level, and */
2808 could_continue && /* Some could go on */
2809 forced_fail != workspace[1] && /* Not all forced fail & */
2810 ( /* either... */
2811 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
2812 || /* or... */
2813 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
2814 match_count < 0) /* no matches */
2815 ) && /* And... */
2816 ptr >= end_subject && /* Reached end of subject */
2817 ptr > md->start_used_ptr) /* Inspected non-empty string */
2818 {
2819 if (offsetcount >= 2)
2820 {
2821 offsets[0] = (int)(md->start_used_ptr - start_subject);
2822 offsets[1] = (int)(end_subject - start_subject);
2823 }
2824 match_count = PCRE_ERROR_PARTIAL;
2825 }
2826
2827 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
2828 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
2829 rlevel*2-2, SP));
2830 break; /* In effect, "return", but see the comment below */
2831 }
2832
2833 /* One or more states are active for the next character. */
2834
2835 ptr += clen; /* Advance to next subject character */
2836 } /* Loop to move along the subject string */
2837
2838 /* Control gets here from "break" a few lines above. We do it this way because
2839 if we use "return" above, we have compiler trouble. Some compilers warn if
2840 there's nothing here because they think the function doesn't return a value. On
2841 the other hand, if we put a dummy statement here, some more clever compilers
2842 complain that it can't be reached. Sigh. */
2843
2844 return match_count;
2845 }
2846
2847
2848
2849
2850 /*************************************************
2851 * Execute a Regular Expression - DFA engine *
2852 *************************************************/
2853
2854 /* This external function applies a compiled re to a subject string using a DFA
2855 engine. This function calls the internal function multiple times if the pattern
2856 is not anchored.
2857
2858 Arguments:
2859 argument_re points to the compiled expression
2860 extra_data points to extra data or is NULL
2861 subject points to the subject string
2862 length length of subject string (may contain binary zeros)
2863 start_offset where to start in the subject string
2864 options option bits
2865 offsets vector of match offsets
2866 offsetcount size of same
2867 workspace workspace vector
2868 wscount size of same
2869
2870 Returns: > 0 => number of match offset pairs placed in offsets
2871 = 0 => offsets overflowed; longest matches are present
2872 -1 => failed to match
2873 < -1 => some kind of unexpected problem
2874 */
2875
2876 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
2877 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
2878 const char *subject, int length, int start_offset, int options, int *offsets,
2879 int offsetcount, int *workspace, int wscount)
2880 {
2881 real_pcre *re = (real_pcre *)argument_re;
2882 dfa_match_data match_block;
2883 dfa_match_data *md = &match_block;
2884 BOOL utf8, anchored, startline, firstline;
2885 const uschar *current_subject, *end_subject, *lcc;
2886
2887 pcre_study_data internal_study;
2888 const pcre_study_data *study = NULL;
2889 real_pcre internal_re;
2890
2891 const uschar *req_byte_ptr;
2892 const uschar *start_bits = NULL;
2893 BOOL first_byte_caseless = FALSE;
2894 BOOL req_byte_caseless = FALSE;
2895 int first_byte = -1;
2896 int req_byte = -1;
2897 int req_byte2 = -1;
2898 int newline;
2899
2900 /* Plausibility checks */
2901
2902 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
2903 if (re == NULL || subject == NULL || workspace == NULL ||
2904 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
2905 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
2906 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
2907 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
2908
2909 /* We need to find the pointer to any study data before we test for byte
2910 flipping, so we scan the extra_data block first. This may set two fields in the
2911 match block, so we must initialize them beforehand. However, the other fields
2912 in the match block must not be set until after the byte flipping. */
2913
2914 md->tables = re->tables;
2915 md->callout_data = NULL;
2916
2917 if (extra_data != NULL)
2918 {
2919 unsigned int flags = extra_data->flags;
2920 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
2921 study = (const pcre_study_data *)extra_data->study_data;
2922 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
2923 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
2924 return PCRE_ERROR_DFA_UMLIMIT;
2925 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
2926 md->callout_data = extra_data->callout_data;
2927 if ((flags & PCRE_EXTRA_TABLES) != 0)
2928 md->tables = extra_data->tables;
2929 }
2930
2931 /* Check that the first field in the block is the magic number. If it is not,
2932 test for a regex that was compiled on a host of opposite endianness. If this is
2933 the case, flipped values are put in internal_re and internal_study if there was
2934 study data too. */
2935
2936 if (re->magic_number != MAGIC_NUMBER)
2937 {
2938 re = _pcre_try_flipped(re, &internal_re, study, &internal_study);
2939 if (re == NULL) return PCRE_ERROR_BADMAGIC;
2940 if (study != NULL) study = &internal_study;
2941 }
2942
2943 /* Set some local values */
2944
2945 current_subject = (const unsigned char *)subject + start_offset;
2946 end_subject = (const unsigned char *)subject + length;
2947 req_byte_ptr = current_subject - 1;
2948
2949 #ifdef SUPPORT_UTF8
2950 utf8 = (re->options & PCRE_UTF8) != 0;
2951 #else
2952 utf8 = FALSE;
2953 #endif
2954
2955 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
2956 (re->options & PCRE_ANCHORED) != 0;
2957
2958 /* The remaining fixed data for passing around. */
2959
2960 md->start_code = (const uschar *)argument_re +
2961 re->name_table_offset + re->name_count * re->name_entry_size;
2962 md->start_subject = (const unsigned char *)subject;
2963 md->end_subject = end_subject;
2964 md->start_offset = start_offset;
2965 md->moptions = options;
2966 md->poptions = re->options;
2967
2968 /* If the BSR option is not set at match time, copy what was set
2969 at compile time. */
2970
2971 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
2972 {
2973 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
2974 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
2975 #ifdef BSR_ANYCRLF
2976 else md->moptions |= PCRE_BSR_ANYCRLF;
2977 #endif
2978 }
2979
2980 /* Handle different types of newline. The three bits give eight cases. If
2981 nothing is set at run time, whatever was used at compile time applies. */
2982
2983 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
2984 PCRE_NEWLINE_BITS)
2985 {
2986 case 0: newline = NEWLINE; break; /* Compile-time default */
2987 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
2988 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
2989 case PCRE_NEWLINE_CR+
2990 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
2991 case PCRE_NEWLINE_ANY: newline = -1; break;
2992 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
2993 default: return PCRE_ERROR_BADNEWLINE;
2994 }
2995
2996 if (newline == -2)
2997 {
2998 md->nltype = NLTYPE_ANYCRLF;
2999 }
3000 else if (newline < 0)
3001 {
3002 md->nltype = NLTYPE_ANY;
3003 }
3004 else
3005 {
3006 md->nltype = NLTYPE_FIXED;
3007 if (newline > 255)
3008 {
3009 md->nllen = 2;
3010 md->nl[0] = (newline >> 8) & 255;
3011 md->nl[1] = newline & 255;
3012 }
3013 else
3014 {
3015 md->nllen = 1;
3016 md->nl[0] = newline;
3017 }
3018 }
3019
3020 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3021 back the character offset. */
3022
3023 #ifdef SUPPORT_UTF8
3024 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
3025 {
3026 int errorcode;
3027 int tb = _pcre_valid_utf8((uschar *)subject, length, &errorcode);
3028 if (tb >= 0)
3029 {
3030 if (offsetcount >= 2)
3031 {
3032 offsets[0] = tb;
3033 offsets[1] = errorcode;
3034 }
3035 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3036 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3037 }
3038 if (start_offset > 0 && start_offset < length)
3039 {
3040 tb = ((USPTR)subject)[start_offset] & 0xc0;
3041 if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
3042 }
3043 }
3044 #endif
3045
3046 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3047 is a feature that makes it possible to save compiled regex and re-use them
3048 in other programs later. */
3049
3050 if (md->tables == NULL) md->tables = _pcre_default_tables;
3051
3052 /* The lower casing table and the "must be at the start of a line" flag are
3053 used in a loop when finding where to start. */
3054
3055 lcc = md->tables + lcc_offset;
3056 startline = (re->flags & PCRE_STARTLINE) != 0;
3057 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3058
3059 /* Set up the first character to match, if available. The first_byte value is
3060 never set for an anchored regular expression, but the anchoring may be forced
3061 at run time, so we have to test for anchoring. The first char may be unset for
3062 an unanchored pattern, of course. If there's no first char and the pattern was
3063 studied, there may be a bitmap of possible first characters. */
3064
3065 if (!anchored)
3066 {
3067 if ((re->flags & PCRE_FIRSTSET) != 0)
3068 {
3069 first_byte = re->first_byte & 255;
3070 if ((first_byte_caseless = ((re->first_byte & REQ_CASELESS) != 0)) == TRUE)
3071 first_byte = lcc[first_byte];
3072 }
3073 else
3074 {
3075 if (!startline && study != NULL &&
3076 (study->flags & PCRE_STUDY_MAPPED) != 0)
3077 start_bits = study->start_bits;
3078 }
3079 }
3080
3081 /* For anchored or unanchored matches, there may be a "last known required
3082 character" set. */
3083
3084 if ((re->flags & PCRE_REQCHSET) != 0)
3085 {
3086 req_byte = re->req_byte & 255;
3087 req_byte_caseless = (re->req_byte & REQ_CASELESS) != 0;
3088 req_byte2 = (md->tables + fcc_offset)[req_byte]; /* case flipped */
3089 }
3090
3091 /* Call the main matching function, looping for a non-anchored regex after a
3092 failed match. If not restarting, perform certain optimizations at the start of
3093 a match. */
3094
3095 for (;;)
3096 {
3097 int rc;
3098
3099 if ((options & PCRE_DFA_RESTART) == 0)
3100 {
3101 const uschar *save_end_subject = end_subject;
3102
3103 /* If firstline is TRUE, the start of the match is constrained to the first
3104 line of a multiline string. Implement this by temporarily adjusting
3105 end_subject so that we stop scanning at a newline. If the match fails at
3106 the newline, later code breaks this loop. */
3107
3108 if (firstline)
3109 {
3110 USPTR t = current_subject;
3111 #ifdef SUPPORT_UTF8
3112 if (utf8)
3113 {
3114 while (t < md->end_subject && !IS_NEWLINE(t))
3115 {
3116 t++;
3117 while (t < end_subject && (*t & 0xc0) == 0x80) t++;
3118 }
3119 }
3120 else
3121 #endif
3122 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3123 end_subject = t;
3124 }
3125
3126 /* There are some optimizations that avoid running the match if a known
3127 starting point is not found. However, there is an option that disables
3128 these, for testing and for ensuring that all callouts do actually occur.
3129 The option can be set in the regex by (*NO_START_OPT) or passed in
3130 match-time options. */
3131
3132 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3133 {
3134 /* Advance to a known first byte. */
3135
3136 if (first_byte >= 0)
3137 {
3138 if (first_byte_caseless)
3139 while (current_subject < end_subject &&
3140 lcc[*current_subject] != first_byte)
3141 current_subject++;
3142 else
3143 while (current_subject < end_subject &&
3144 *current_subject != first_byte)
3145 current_subject++;
3146 }
3147
3148 /* Or to just after a linebreak for a multiline match if possible */
3149
3150 else if (startline)
3151 {
3152 if (current_subject > md->start_subject + start_offset)
3153 {
3154 #ifdef SUPPORT_UTF8
3155 if (utf8)
3156 {
3157 while (current_subject < end_subject &&
3158 !WAS_NEWLINE(current_subject))
3159 {
3160 current_subject++;
3161 while(current_subject < end_subject &&
3162 (*current_subject & 0xc0) == 0x80)
3163 current_subject++;
3164 }
3165 }
3166 else
3167 #endif
3168 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3169 current_subject++;
3170
3171 /* If we have just passed a CR and the newline option is ANY or
3172 ANYCRLF, and we are now at a LF, advance the match position by one
3173 more character. */
3174
3175 if (current_subject[-1] == CHAR_CR &&
3176 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3177 current_subject < end_subject &&
3178 *current_subject == CHAR_NL)
3179 current_subject++;
3180 }
3181 }
3182
3183 /* Or to a non-unique first char after study */
3184
3185 else if (start_bits != NULL)
3186 {
3187 while (current_subject < end_subject)
3188 {
3189 register unsigned int c = *current_subject;
3190 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3191 {
3192 current_subject++;
3193 #ifdef SUPPORT_UTF8
3194 if (utf8)
3195 while(current_subject < end_subject &&
3196 (*current_subject & 0xc0) == 0x80) current_subject++;
3197 #endif
3198 }
3199 else break;
3200 }
3201 }
3202 }
3203
3204 /* Restore fudged end_subject */
3205
3206 end_subject = save_end_subject;
3207
3208 /* The following two optimizations are disabled for partial matching or if
3209 disabling is explicitly requested (and of course, by the test above, this
3210 code is not obeyed when restarting after a partial match). */
3211
3212 if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
3213 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3214 {
3215 /* If the pattern was studied, a minimum subject length may be set. This
3216 is a lower bound; no actual string of that length may actually match the
3217 pattern. Although the value is, strictly, in characters, we treat it as
3218 bytes to avoid spending too much time in this optimization. */
3219
3220 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3221 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3222 return PCRE_ERROR_NOMATCH;
3223
3224 /* If req_byte is set, we know that that character must appear in the
3225 subject for the match to succeed. If the first character is set, req_byte
3226 must be later in the subject; otherwise the test starts at the match
3227 point. This optimization can save a huge amount of work in patterns with
3228 nested unlimited repeats that aren't going to match. Writing separate
3229 code for cased/caseless versions makes it go faster, as does using an
3230 autoincrement and backing off on a match.
3231
3232 HOWEVER: when the subject string is very, very long, searching to its end
3233 can take a long time, and give bad performance on quite ordinary
3234 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3235 string... so we don't do this when the string is sufficiently long. */
3236
3237 if (req_byte >= 0 && end_subject - current_subject < REQ_BYTE_MAX)
3238 {
3239 register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
3240
3241 /* We don't need to repeat the search if we haven't yet reached the
3242 place we found it at last time. */
3243
3244 if (p > req_byte_ptr)
3245 {
3246 if (req_byte_caseless)
3247 {
3248 while (p < end_subject)
3249 {
3250 register int pp = *p++;
3251 if (pp == req_byte || pp == req_byte2) { p--; break; }
3252 }
3253 }
3254 else
3255 {
3256 while (p < end_subject)
3257 {
3258 if (*p++ == req_byte) { p--; break; }
3259 }
3260 }
3261
3262 /* If we can't find the required character, break the matching loop,
3263 which will cause a return or PCRE_ERROR_NOMATCH. */
3264
3265 if (p >= end_subject) break;
3266
3267 /* If we have found the required character, save the point where we
3268 found it, so that we don't search again next time round the loop if
3269 the start hasn't passed this character yet. */
3270
3271 req_byte_ptr = p;
3272 }
3273 }
3274 }
3275 } /* End of optimizations that are done when not restarting */
3276
3277 /* OK, now we can do the business */
3278
3279 md->start_used_ptr = current_subject;
3280
3281 rc = internal_dfa_exec(
3282 md, /* fixed match data */
3283 md->start_code, /* this subexpression's code */
3284 current_subject, /* where we currently are */
3285 start_offset, /* start offset in subject */
3286 offsets, /* offset vector */
3287 offsetcount, /* size of same */
3288 workspace, /* workspace vector */
3289 wscount, /* size of same */
3290 0, /* function recurse level */
3291 0); /* regex recurse level */
3292
3293 /* Anything other than "no match" means we are done, always; otherwise, carry
3294 on only if not anchored. */
3295
3296 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3297
3298 /* Advance to the next subject character unless we are at the end of a line
3299 and firstline is set. */
3300
3301 if (firstline && IS_NEWLINE(current_subject)) break;
3302 current_subject++;
3303 if (utf8)
3304 {
3305 while (current_subject < end_subject && (*current_subject & 0xc0) == 0x80)
3306 current_subject++;
3307 }
3308 if (current_subject > end_subject) break;
3309
3310 /* If we have just passed a CR and we are now at a LF, and the pattern does
3311 not contain any explicit matches for \r or \n, and the newline option is CRLF
3312 or ANY or ANYCRLF, advance the match position by one more character. */
3313
3314 if (current_subject[-1] == CHAR_CR &&
3315 current_subject < end_subject &&
3316 *current_subject == CHAR_NL &&
3317 (re->flags & PCRE_HASCRORLF) == 0 &&
3318 (md->nltype == NLTYPE_ANY ||
3319 md->nltype == NLTYPE_ANYCRLF ||
3320 md->nllen == 2))
3321 current_subject++;
3322
3323 } /* "Bumpalong" loop */
3324
3325 return PCRE_ERROR_NOMATCH;
3326 }
3327
3328 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5