/[pcre]/code/trunk/pcre_dfa_exec.c
ViewVC logotype

Contents of /code/trunk/pcre_dfa_exec.c

Parent Directory Parent Directory | Revision Log Revision Log


Revision 642 - (show annotations)
Thu Jul 28 18:59:40 2011 UTC (4 years ago) by ph10
File MIME type: text/plain
File size: 119452 byte(s)
Error occurred while calculating annotation data.
Avoid false positive for infinite recursion by not checking conditionals at 
compile time, but add tests at runtime that also catch infinite mutual 
recursion.
1 /*************************************************
2 * Perl-Compatible Regular Expressions *
3 *************************************************/
4
5 /* PCRE is a library of functions to support regular expressions whose syntax
6 and semantics are as close as possible to those of the Perl 5 language (but see
7 below for why this module is different).
8
9 Written by Philip Hazel
10 Copyright (c) 1997-2011 University of Cambridge
11
12 -----------------------------------------------------------------------------
13 Redistribution and use in source and binary forms, with or without
14 modification, are permitted provided that the following conditions are met:
15
16 * Redistributions of source code must retain the above copyright notice,
17 this list of conditions and the following disclaimer.
18
19 * Redistributions in binary form must reproduce the above copyright
20 notice, this list of conditions and the following disclaimer in the
21 documentation and/or other materials provided with the distribution.
22
23 * Neither the name of the University of Cambridge nor the names of its
24 contributors may be used to endorse or promote products derived from
25 this software without specific prior written permission.
26
27 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
28 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
29 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
30 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
31 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
32 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
33 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
34 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
35 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
36 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
37 POSSIBILITY OF SUCH DAMAGE.
38 -----------------------------------------------------------------------------
39 */
40
41
42 /* This module contains the external function pcre_dfa_exec(), which is an
43 alternative matching function that uses a sort of DFA algorithm (not a true
44 FSM). This is NOT Perl- compatible, but it has advantages in certain
45 applications. */
46
47
48 /* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
49 the performance of his patterns greatly. I could not use it as it stood, as it
50 was not thread safe, and made assumptions about pattern sizes. Also, it caused
51 test 7 to loop, and test 9 to crash with a segfault.
52
53 The issue is the check for duplicate states, which is done by a simple linear
54 search up the state list. (Grep for "duplicate" below to find the code.) For
55 many patterns, there will never be many states active at one time, so a simple
56 linear search is fine. In patterns that have many active states, it might be a
57 bottleneck. The suggested code used an indexing scheme to remember which states
58 had previously been used for each character, and avoided the linear search when
59 it knew there was no chance of a duplicate. This was implemented when adding
60 states to the state lists.
61
62 I wrote some thread-safe, not-limited code to try something similar at the time
63 of checking for duplicates (instead of when adding states), using index vectors
64 on the stack. It did give a 13% improvement with one specially constructed
65 pattern for certain subject strings, but on other strings and on many of the
66 simpler patterns in the test suite it did worse. The major problem, I think,
67 was the extra time to initialize the index. This had to be done for each call
68 of internal_dfa_exec(). (The supplied patch used a static vector, initialized
69 only once - I suspect this was the cause of the problems with the tests.)
70
71 Overall, I concluded that the gains in some cases did not outweigh the losses
72 in others, so I abandoned this code. */
73
74
75
76 #ifdef HAVE_CONFIG_H
77 #include "config.h"
78 #endif
79
80 #define NLBLOCK md /* Block containing newline information */
81 #define PSSTART start_subject /* Field containing processed string start */
82 #define PSEND end_subject /* Field containing processed string end */
83
84 #include "pcre_internal.h"
85
86
87 /* For use to indent debugging output */
88
89 #define SP " "
90
91
92 /*************************************************
93 * Code parameters and static tables *
94 *************************************************/
95
96 /* These are offsets that are used to turn the OP_TYPESTAR and friends opcodes
97 into others, under special conditions. A gap of 20 between the blocks should be
98 enough. The resulting opcodes don't have to be less than 256 because they are
99 never stored, so we push them well clear of the normal opcodes. */
100
101 #define OP_PROP_EXTRA 300
102 #define OP_EXTUNI_EXTRA 320
103 #define OP_ANYNL_EXTRA 340
104 #define OP_HSPACE_EXTRA 360
105 #define OP_VSPACE_EXTRA 380
106
107
108 /* This table identifies those opcodes that are followed immediately by a
109 character that is to be tested in some way. This makes it possible to
110 centralize the loading of these characters. In the case of Type * etc, the
111 "character" is the opcode for \D, \d, \S, \s, \W, or \w, which will always be a
112 small value. Non-zero values in the table are the offsets from the opcode where
113 the character is to be found. ***NOTE*** If the start of this table is
114 modified, the three tables that follow must also be modified. */
115
116 static const uschar coptable[] = {
117 0, /* End */
118 0, 0, 0, 0, 0, /* \A, \G, \K, \B, \b */
119 0, 0, 0, 0, 0, 0, /* \D, \d, \S, \s, \W, \w */
120 0, 0, 0, /* Any, AllAny, Anybyte */
121 0, 0, /* \P, \p */
122 0, 0, 0, 0, 0, /* \R, \H, \h, \V, \v */
123 0, /* \X */
124 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
125 1, /* Char */
126 1, /* Chari */
127 1, /* not */
128 1, /* noti */
129 /* Positive single-char repeats */
130 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
131 3, 3, 3, /* upto, minupto, exact */
132 1, 1, 1, 3, /* *+, ++, ?+, upto+ */
133 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
134 3, 3, 3, /* upto I, minupto I, exact I */
135 1, 1, 1, 3, /* *+I, ++I, ?+I, upto+I */
136 /* Negative single-char repeats - only for chars < 256 */
137 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
138 3, 3, 3, /* NOT upto, minupto, exact */
139 1, 1, 1, 3, /* NOT *+, ++, ?+, upto+ */
140 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
141 3, 3, 3, /* NOT upto I, minupto I, exact I */
142 1, 1, 1, 3, /* NOT *+I, ++I, ?+I, upto+I */
143 /* Positive type repeats */
144 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
145 3, 3, 3, /* Type upto, minupto, exact */
146 1, 1, 1, 3, /* Type *+, ++, ?+, upto+ */
147 /* Character class & ref repeats */
148 0, 0, 0, 0, 0, 0, /* *, *?, +, +?, ?, ?? */
149 0, 0, /* CRRANGE, CRMINRANGE */
150 0, /* CLASS */
151 0, /* NCLASS */
152 0, /* XCLASS - variable length */
153 0, /* REF */
154 0, /* REFI */
155 0, /* RECURSE */
156 0, /* CALLOUT */
157 0, /* Alt */
158 0, /* Ket */
159 0, /* KetRmax */
160 0, /* KetRmin */
161 0, /* KetRpos */
162 0, /* Reverse */
163 0, /* Assert */
164 0, /* Assert not */
165 0, /* Assert behind */
166 0, /* Assert behind not */
167 0, 0, 0, 0, 0, 0, /* ONCE, BRA, BRAPOS, CBRA, CBRAPOS, COND */
168 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
169 0, 0, /* CREF, NCREF */
170 0, 0, /* RREF, NRREF */
171 0, /* DEF */
172 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
173 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
174 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
175 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
176 0, 0 /* CLOSE, SKIPZERO */
177 };
178
179 /* This table identifies those opcodes that inspect a character. It is used to
180 remember the fact that a character could have been inspected when the end of
181 the subject is reached. ***NOTE*** If the start of this table is modified, the
182 two tables that follow must also be modified. */
183
184 static const uschar poptable[] = {
185 0, /* End */
186 0, 0, 0, 1, 1, /* \A, \G, \K, \B, \b */
187 1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */
188 1, 1, 1, /* Any, AllAny, Anybyte */
189 1, 1, /* \P, \p */
190 1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */
191 1, /* \X */
192 0, 0, 0, 0, 0, 0, /* \Z, \z, ^, ^M, $, $M */
193 1, /* Char */
194 1, /* Chari */
195 1, /* not */
196 1, /* noti */
197 /* Positive single-char repeats */
198 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
199 1, 1, 1, /* upto, minupto, exact */
200 1, 1, 1, 1, /* *+, ++, ?+, upto+ */
201 1, 1, 1, 1, 1, 1, /* *I, *?I, +I, +?I, ?I, ??I */
202 1, 1, 1, /* upto I, minupto I, exact I */
203 1, 1, 1, 1, /* *+I, ++I, ?+I, upto+I */
204 /* Negative single-char repeats - only for chars < 256 */
205 1, 1, 1, 1, 1, 1, /* NOT *, *?, +, +?, ?, ?? */
206 1, 1, 1, /* NOT upto, minupto, exact */
207 1, 1, 1, 1, /* NOT *+, ++, ?+, upto+ */
208 1, 1, 1, 1, 1, 1, /* NOT *I, *?I, +I, +?I, ?I, ??I */
209 1, 1, 1, /* NOT upto I, minupto I, exact I */
210 1, 1, 1, 1, /* NOT *+I, ++I, ?+I, upto+I */
211 /* Positive type repeats */
212 1, 1, 1, 1, 1, 1, /* Type *, *?, +, +?, ?, ?? */
213 1, 1, 1, /* Type upto, minupto, exact */
214 1, 1, 1, 1, /* Type *+, ++, ?+, upto+ */
215 /* Character class & ref repeats */
216 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */
217 1, 1, /* CRRANGE, CRMINRANGE */
218 1, /* CLASS */
219 1, /* NCLASS */
220 1, /* XCLASS - variable length */
221 0, /* REF */
222 0, /* REFI */
223 0, /* RECURSE */
224 0, /* CALLOUT */
225 0, /* Alt */
226 0, /* Ket */
227 0, /* KetRmax */
228 0, /* KetRmin */
229 0, /* KetRpos */
230 0, /* Reverse */
231 0, /* Assert */
232 0, /* Assert not */
233 0, /* Assert behind */
234 0, /* Assert behind not */
235 0, 0, 0, 0, 0, 0, /* ONCE, BRA, BRAPOS, CBRA, CBRAPOS, COND */
236 0, 0, 0, 0, 0, /* SBRA, SBRAPOS, SCBRA, SCBRAPOS, SCOND */
237 0, 0, /* CREF, NCREF */
238 0, 0, /* RREF, NRREF */
239 0, /* DEF */
240 0, 0, 0, /* BRAZERO, BRAMINZERO, BRAPOSZERO */
241 0, 0, 0, /* MARK, PRUNE, PRUNE_ARG */
242 0, 0, 0, 0, /* SKIP, SKIP_ARG, THEN, THEN_ARG */
243 0, 0, 0, 0, /* COMMIT, FAIL, ACCEPT, ASSERT_ACCEPT */
244 0, 0 /* CLOSE, SKIPZERO */
245 };
246
247 /* These 2 tables allow for compact code for testing for \D, \d, \S, \s, \W,
248 and \w */
249
250 static const uschar toptable1[] = {
251 0, 0, 0, 0, 0, 0,
252 ctype_digit, ctype_digit,
253 ctype_space, ctype_space,
254 ctype_word, ctype_word,
255 0, 0 /* OP_ANY, OP_ALLANY */
256 };
257
258 static const uschar toptable2[] = {
259 0, 0, 0, 0, 0, 0,
260 ctype_digit, 0,
261 ctype_space, 0,
262 ctype_word, 0,
263 1, 1 /* OP_ANY, OP_ALLANY */
264 };
265
266
267 /* Structure for holding data about a particular state, which is in effect the
268 current data for an active path through the match tree. It must consist
269 entirely of ints because the working vector we are passed, and which we put
270 these structures in, is a vector of ints. */
271
272 typedef struct stateblock {
273 int offset; /* Offset to opcode */
274 int count; /* Count for repeats */
275 int data; /* Some use extra data */
276 } stateblock;
277
278 #define INTS_PER_STATEBLOCK (sizeof(stateblock)/sizeof(int))
279
280
281 #ifdef PCRE_DEBUG
282 /*************************************************
283 * Print character string *
284 *************************************************/
285
286 /* Character string printing function for debugging.
287
288 Arguments:
289 p points to string
290 length number of bytes
291 f where to print
292
293 Returns: nothing
294 */
295
296 static void
297 pchars(unsigned char *p, int length, FILE *f)
298 {
299 int c;
300 while (length-- > 0)
301 {
302 if (isprint(c = *(p++)))
303 fprintf(f, "%c", c);
304 else
305 fprintf(f, "\\x%02x", c);
306 }
307 }
308 #endif
309
310
311
312 /*************************************************
313 * Execute a Regular Expression - DFA engine *
314 *************************************************/
315
316 /* This internal function applies a compiled pattern to a subject string,
317 starting at a given point, using a DFA engine. This function is called from the
318 external one, possibly multiple times if the pattern is not anchored. The
319 function calls itself recursively for some kinds of subpattern.
320
321 Arguments:
322 md the match_data block with fixed information
323 this_start_code the opening bracket of this subexpression's code
324 current_subject where we currently are in the subject string
325 start_offset start offset in the subject string
326 offsets vector to contain the matching string offsets
327 offsetcount size of same
328 workspace vector of workspace
329 wscount size of same
330 rlevel function call recursion level
331
332 Returns: > 0 => number of match offset pairs placed in offsets
333 = 0 => offsets overflowed; longest matches are present
334 -1 => failed to match
335 < -1 => some kind of unexpected problem
336
337 The following macros are used for adding states to the two state vectors (one
338 for the current character, one for the following character). */
339
340 #define ADD_ACTIVE(x,y) \
341 if (active_count++ < wscount) \
342 { \
343 next_active_state->offset = (x); \
344 next_active_state->count = (y); \
345 next_active_state++; \
346 DPRINTF(("%.*sADD_ACTIVE(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
347 } \
348 else return PCRE_ERROR_DFA_WSSIZE
349
350 #define ADD_ACTIVE_DATA(x,y,z) \
351 if (active_count++ < wscount) \
352 { \
353 next_active_state->offset = (x); \
354 next_active_state->count = (y); \
355 next_active_state->data = (z); \
356 next_active_state++; \
357 DPRINTF(("%.*sADD_ACTIVE_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
358 } \
359 else return PCRE_ERROR_DFA_WSSIZE
360
361 #define ADD_NEW(x,y) \
362 if (new_count++ < wscount) \
363 { \
364 next_new_state->offset = (x); \
365 next_new_state->count = (y); \
366 next_new_state++; \
367 DPRINTF(("%.*sADD_NEW(%d,%d)\n", rlevel*2-2, SP, (x), (y))); \
368 } \
369 else return PCRE_ERROR_DFA_WSSIZE
370
371 #define ADD_NEW_DATA(x,y,z) \
372 if (new_count++ < wscount) \
373 { \
374 next_new_state->offset = (x); \
375 next_new_state->count = (y); \
376 next_new_state->data = (z); \
377 next_new_state++; \
378 DPRINTF(("%.*sADD_NEW_DATA(%d,%d,%d)\n", rlevel*2-2, SP, (x), (y), (z))); \
379 } \
380 else return PCRE_ERROR_DFA_WSSIZE
381
382 /* And now, here is the code */
383
384 static int
385 internal_dfa_exec(
386 dfa_match_data *md,
387 const uschar *this_start_code,
388 const uschar *current_subject,
389 int start_offset,
390 int *offsets,
391 int offsetcount,
392 int *workspace,
393 int wscount,
394 int rlevel)
395 {
396 stateblock *active_states, *new_states, *temp_states;
397 stateblock *next_active_state, *next_new_state;
398
399 const uschar *ctypes, *lcc, *fcc;
400 const uschar *ptr;
401 const uschar *end_code, *first_op;
402
403 dfa_recursion_info new_recursive;
404
405 int active_count, new_count, match_count;
406
407 /* Some fields in the md block are frequently referenced, so we load them into
408 independent variables in the hope that this will perform better. */
409
410 const uschar *start_subject = md->start_subject;
411 const uschar *end_subject = md->end_subject;
412 const uschar *start_code = md->start_code;
413
414 #ifdef SUPPORT_UTF8
415 BOOL utf8 = (md->poptions & PCRE_UTF8) != 0;
416 #else
417 BOOL utf8 = FALSE;
418 #endif
419
420 rlevel++;
421 offsetcount &= (-2);
422
423 wscount -= 2;
424 wscount = (wscount - (wscount % (INTS_PER_STATEBLOCK * 2))) /
425 (2 * INTS_PER_STATEBLOCK);
426
427 DPRINTF(("\n%.*s---------------------\n"
428 "%.*sCall to internal_dfa_exec f=%d\n",
429 rlevel*2-2, SP, rlevel*2-2, SP, rlevel));
430
431 ctypes = md->tables + ctypes_offset;
432 lcc = md->tables + lcc_offset;
433 fcc = md->tables + fcc_offset;
434
435 match_count = PCRE_ERROR_NOMATCH; /* A negative number */
436
437 active_states = (stateblock *)(workspace + 2);
438 next_new_state = new_states = active_states + wscount;
439 new_count = 0;
440
441 first_op = this_start_code + 1 + LINK_SIZE +
442 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
443 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)? 2:0);
444
445 /* The first thing in any (sub) pattern is a bracket of some sort. Push all
446 the alternative states onto the list, and find out where the end is. This
447 makes is possible to use this function recursively, when we want to stop at a
448 matching internal ket rather than at the end.
449
450 If the first opcode in the first alternative is OP_REVERSE, we are dealing with
451 a backward assertion. In that case, we have to find out the maximum amount to
452 move back, and set up each alternative appropriately. */
453
454 if (*first_op == OP_REVERSE)
455 {
456 int max_back = 0;
457 int gone_back;
458
459 end_code = this_start_code;
460 do
461 {
462 int back = GET(end_code, 2+LINK_SIZE);
463 if (back > max_back) max_back = back;
464 end_code += GET(end_code, 1);
465 }
466 while (*end_code == OP_ALT);
467
468 /* If we can't go back the amount required for the longest lookbehind
469 pattern, go back as far as we can; some alternatives may still be viable. */
470
471 #ifdef SUPPORT_UTF8
472 /* In character mode we have to step back character by character */
473
474 if (utf8)
475 {
476 for (gone_back = 0; gone_back < max_back; gone_back++)
477 {
478 if (current_subject <= start_subject) break;
479 current_subject--;
480 while (current_subject > start_subject &&
481 (*current_subject & 0xc0) == 0x80)
482 current_subject--;
483 }
484 }
485 else
486 #endif
487
488 /* In byte-mode we can do this quickly. */
489
490 {
491 gone_back = (current_subject - max_back < start_subject)?
492 (int)(current_subject - start_subject) : max_back;
493 current_subject -= gone_back;
494 }
495
496 /* Save the earliest consulted character */
497
498 if (current_subject < md->start_used_ptr)
499 md->start_used_ptr = current_subject;
500
501 /* Now we can process the individual branches. */
502
503 end_code = this_start_code;
504 do
505 {
506 int back = GET(end_code, 2+LINK_SIZE);
507 if (back <= gone_back)
508 {
509 int bstate = (int)(end_code - start_code + 2 + 2*LINK_SIZE);
510 ADD_NEW_DATA(-bstate, 0, gone_back - back);
511 }
512 end_code += GET(end_code, 1);
513 }
514 while (*end_code == OP_ALT);
515 }
516
517 /* This is the code for a "normal" subpattern (not a backward assertion). The
518 start of a whole pattern is always one of these. If we are at the top level,
519 we may be asked to restart matching from the same point that we reached for a
520 previous partial match. We still have to scan through the top-level branches to
521 find the end state. */
522
523 else
524 {
525 end_code = this_start_code;
526
527 /* Restarting */
528
529 if (rlevel == 1 && (md->moptions & PCRE_DFA_RESTART) != 0)
530 {
531 do { end_code += GET(end_code, 1); } while (*end_code == OP_ALT);
532 new_count = workspace[1];
533 if (!workspace[0])
534 memcpy(new_states, active_states, new_count * sizeof(stateblock));
535 }
536
537 /* Not restarting */
538
539 else
540 {
541 int length = 1 + LINK_SIZE +
542 ((*this_start_code == OP_CBRA || *this_start_code == OP_SCBRA ||
543 *this_start_code == OP_CBRAPOS || *this_start_code == OP_SCBRAPOS)?
544 2:0);
545 do
546 {
547 ADD_NEW((int)(end_code - start_code + length), 0);
548 end_code += GET(end_code, 1);
549 length = 1 + LINK_SIZE;
550 }
551 while (*end_code == OP_ALT);
552 }
553 }
554
555 workspace[0] = 0; /* Bit indicating which vector is current */
556
557 DPRINTF(("%.*sEnd state = %d\n", rlevel*2-2, SP, end_code - start_code));
558
559 /* Loop for scanning the subject */
560
561 ptr = current_subject;
562 for (;;)
563 {
564 int i, j;
565 int clen, dlen;
566 unsigned int c, d;
567 int forced_fail = 0;
568 BOOL could_continue = FALSE;
569
570 /* Make the new state list into the active state list and empty the
571 new state list. */
572
573 temp_states = active_states;
574 active_states = new_states;
575 new_states = temp_states;
576 active_count = new_count;
577 new_count = 0;
578
579 workspace[0] ^= 1; /* Remember for the restarting feature */
580 workspace[1] = active_count;
581
582 #ifdef PCRE_DEBUG
583 printf("%.*sNext character: rest of subject = \"", rlevel*2-2, SP);
584 pchars((uschar *)ptr, strlen((char *)ptr), stdout);
585 printf("\"\n");
586
587 printf("%.*sActive states: ", rlevel*2-2, SP);
588 for (i = 0; i < active_count; i++)
589 printf("%d/%d ", active_states[i].offset, active_states[i].count);
590 printf("\n");
591 #endif
592
593 /* Set the pointers for adding new states */
594
595 next_active_state = active_states + active_count;
596 next_new_state = new_states;
597
598 /* Load the current character from the subject outside the loop, as many
599 different states may want to look at it, and we assume that at least one
600 will. */
601
602 if (ptr < end_subject)
603 {
604 clen = 1; /* Number of bytes in the character */
605 #ifdef SUPPORT_UTF8
606 if (utf8) { GETCHARLEN(c, ptr, clen); } else
607 #endif /* SUPPORT_UTF8 */
608 c = *ptr;
609 }
610 else
611 {
612 clen = 0; /* This indicates the end of the subject */
613 c = NOTACHAR; /* This value should never actually be used */
614 }
615
616 /* Scan up the active states and act on each one. The result of an action
617 may be to add more states to the currently active list (e.g. on hitting a
618 parenthesis) or it may be to put states on the new list, for considering
619 when we move the character pointer on. */
620
621 for (i = 0; i < active_count; i++)
622 {
623 stateblock *current_state = active_states + i;
624 BOOL caseless = FALSE;
625 const uschar *code;
626 int state_offset = current_state->offset;
627 int count, codevalue, rrc;
628
629 #ifdef PCRE_DEBUG
630 printf ("%.*sProcessing state %d c=", rlevel*2-2, SP, state_offset);
631 if (clen == 0) printf("EOL\n");
632 else if (c > 32 && c < 127) printf("'%c'\n", c);
633 else printf("0x%02x\n", c);
634 #endif
635
636 /* A negative offset is a special case meaning "hold off going to this
637 (negated) state until the number of characters in the data field have
638 been skipped". */
639
640 if (state_offset < 0)
641 {
642 if (current_state->data > 0)
643 {
644 DPRINTF(("%.*sSkipping this character\n", rlevel*2-2, SP));
645 ADD_NEW_DATA(state_offset, current_state->count,
646 current_state->data - 1);
647 continue;
648 }
649 else
650 {
651 current_state->offset = state_offset = -state_offset;
652 }
653 }
654
655 /* Check for a duplicate state with the same count, and skip if found.
656 See the note at the head of this module about the possibility of improving
657 performance here. */
658
659 for (j = 0; j < i; j++)
660 {
661 if (active_states[j].offset == state_offset &&
662 active_states[j].count == current_state->count)
663 {
664 DPRINTF(("%.*sDuplicate state: skipped\n", rlevel*2-2, SP));
665 goto NEXT_ACTIVE_STATE;
666 }
667 }
668
669 /* The state offset is the offset to the opcode */
670
671 code = start_code + state_offset;
672 codevalue = *code;
673
674 /* If this opcode inspects a character, but we are at the end of the
675 subject, remember the fact for use when testing for a partial match. */
676
677 if (clen == 0 && poptable[codevalue] != 0)
678 could_continue = TRUE;
679
680 /* If this opcode is followed by an inline character, load it. It is
681 tempting to test for the presence of a subject character here, but that
682 is wrong, because sometimes zero repetitions of the subject are
683 permitted.
684
685 We also use this mechanism for opcodes such as OP_TYPEPLUS that take an
686 argument that is not a data character - but is always one byte long. We
687 have to take special action to deal with \P, \p, \H, \h, \V, \v and \X in
688 this case. To keep the other cases fast, convert these ones to new opcodes.
689 */
690
691 if (coptable[codevalue] > 0)
692 {
693 dlen = 1;
694 #ifdef SUPPORT_UTF8
695 if (utf8) { GETCHARLEN(d, (code + coptable[codevalue]), dlen); } else
696 #endif /* SUPPORT_UTF8 */
697 d = code[coptable[codevalue]];
698 if (codevalue >= OP_TYPESTAR)
699 {
700 switch(d)
701 {
702 case OP_ANYBYTE: return PCRE_ERROR_DFA_UITEM;
703 case OP_NOTPROP:
704 case OP_PROP: codevalue += OP_PROP_EXTRA; break;
705 case OP_ANYNL: codevalue += OP_ANYNL_EXTRA; break;
706 case OP_EXTUNI: codevalue += OP_EXTUNI_EXTRA; break;
707 case OP_NOT_HSPACE:
708 case OP_HSPACE: codevalue += OP_HSPACE_EXTRA; break;
709 case OP_NOT_VSPACE:
710 case OP_VSPACE: codevalue += OP_VSPACE_EXTRA; break;
711 default: break;
712 }
713 }
714 }
715 else
716 {
717 dlen = 0; /* Not strictly necessary, but compilers moan */
718 d = NOTACHAR; /* if these variables are not set. */
719 }
720
721
722 /* Now process the individual opcodes */
723
724 switch (codevalue)
725 {
726 /* ========================================================================== */
727 /* These cases are never obeyed. This is a fudge that causes a compile-
728 time error if the vectors coptable or poptable, which are indexed by
729 opcode, are not the correct length. It seems to be the only way to do
730 such a check at compile time, as the sizeof() operator does not work
731 in the C preprocessor. */
732
733 case OP_TABLE_LENGTH:
734 case OP_TABLE_LENGTH +
735 ((sizeof(coptable) == OP_TABLE_LENGTH) &&
736 (sizeof(poptable) == OP_TABLE_LENGTH)):
737 break;
738
739 /* ========================================================================== */
740 /* Reached a closing bracket. If not at the end of the pattern, carry
741 on with the next opcode. For repeating opcodes, also add the repeat
742 state. Note that KETRPOS will always be encountered at the end of the
743 subpattern, because the possessive subpattern repeats are always handled
744 using recursive calls. Thus, it never adds any new states.
745
746 At the end of the (sub)pattern, unless we have an empty string and
747 PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
748 start of the subject, save the match data, shifting up all previous
749 matches so we always have the longest first. */
750
751 case OP_KET:
752 case OP_KETRMIN:
753 case OP_KETRMAX:
754 case OP_KETRPOS:
755 if (code != end_code)
756 {
757 ADD_ACTIVE(state_offset + 1 + LINK_SIZE, 0);
758 if (codevalue != OP_KET)
759 {
760 ADD_ACTIVE(state_offset - GET(code, 1), 0);
761 }
762 }
763 else
764 {
765 if (ptr > current_subject ||
766 ((md->moptions & PCRE_NOTEMPTY) == 0 &&
767 ((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
768 current_subject > start_subject + md->start_offset)))
769 {
770 if (match_count < 0) match_count = (offsetcount >= 2)? 1 : 0;
771 else if (match_count > 0 && ++match_count * 2 >= offsetcount)
772 match_count = 0;
773 count = ((match_count == 0)? offsetcount : match_count * 2) - 2;
774 if (count > 0) memmove(offsets + 2, offsets, count * sizeof(int));
775 if (offsetcount >= 2)
776 {
777 offsets[0] = (int)(current_subject - start_subject);
778 offsets[1] = (int)(ptr - start_subject);
779 DPRINTF(("%.*sSet matched string = \"%.*s\"\n", rlevel*2-2, SP,
780 offsets[1] - offsets[0], current_subject));
781 }
782 if ((md->moptions & PCRE_DFA_SHORTEST) != 0)
783 {
784 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
785 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel,
786 match_count, rlevel*2-2, SP));
787 return match_count;
788 }
789 }
790 }
791 break;
792
793 /* ========================================================================== */
794 /* These opcodes add to the current list of states without looking
795 at the current character. */
796
797 /*-----------------------------------------------------------------*/
798 case OP_ALT:
799 do { code += GET(code, 1); } while (*code == OP_ALT);
800 ADD_ACTIVE((int)(code - start_code), 0);
801 break;
802
803 /*-----------------------------------------------------------------*/
804 case OP_BRA:
805 case OP_SBRA:
806 do
807 {
808 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
809 code += GET(code, 1);
810 }
811 while (*code == OP_ALT);
812 break;
813
814 /*-----------------------------------------------------------------*/
815 case OP_CBRA:
816 case OP_SCBRA:
817 ADD_ACTIVE((int)(code - start_code + 3 + LINK_SIZE), 0);
818 code += GET(code, 1);
819 while (*code == OP_ALT)
820 {
821 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
822 code += GET(code, 1);
823 }
824 break;
825
826 /*-----------------------------------------------------------------*/
827 case OP_BRAZERO:
828 case OP_BRAMINZERO:
829 ADD_ACTIVE(state_offset + 1, 0);
830 code += 1 + GET(code, 2);
831 while (*code == OP_ALT) code += GET(code, 1);
832 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
833 break;
834
835 /*-----------------------------------------------------------------*/
836 case OP_SKIPZERO:
837 code += 1 + GET(code, 2);
838 while (*code == OP_ALT) code += GET(code, 1);
839 ADD_ACTIVE((int)(code - start_code + 1 + LINK_SIZE), 0);
840 break;
841
842 /*-----------------------------------------------------------------*/
843 case OP_CIRC:
844 if (ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0)
845 { ADD_ACTIVE(state_offset + 1, 0); }
846 break;
847
848 /*-----------------------------------------------------------------*/
849 case OP_CIRCM:
850 if ((ptr == start_subject && (md->moptions & PCRE_NOTBOL) == 0) ||
851 (ptr != end_subject && WAS_NEWLINE(ptr)))
852 { ADD_ACTIVE(state_offset + 1, 0); }
853 break;
854
855 /*-----------------------------------------------------------------*/
856 case OP_EOD:
857 if (ptr >= end_subject)
858 {
859 if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
860 could_continue = TRUE;
861 else { ADD_ACTIVE(state_offset + 1, 0); }
862 }
863 break;
864
865 /*-----------------------------------------------------------------*/
866 case OP_SOD:
867 if (ptr == start_subject) { ADD_ACTIVE(state_offset + 1, 0); }
868 break;
869
870 /*-----------------------------------------------------------------*/
871 case OP_SOM:
872 if (ptr == start_subject + start_offset) { ADD_ACTIVE(state_offset + 1, 0); }
873 break;
874
875
876 /* ========================================================================== */
877 /* These opcodes inspect the next subject character, and sometimes
878 the previous one as well, but do not have an argument. The variable
879 clen contains the length of the current character and is zero if we are
880 at the end of the subject. */
881
882 /*-----------------------------------------------------------------*/
883 case OP_ANY:
884 if (clen > 0 && !IS_NEWLINE(ptr))
885 { ADD_NEW(state_offset + 1, 0); }
886 break;
887
888 /*-----------------------------------------------------------------*/
889 case OP_ALLANY:
890 if (clen > 0)
891 { ADD_NEW(state_offset + 1, 0); }
892 break;
893
894 /*-----------------------------------------------------------------*/
895 case OP_EODN:
896 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
897 could_continue = TRUE;
898 else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
899 { ADD_ACTIVE(state_offset + 1, 0); }
900 break;
901
902 /*-----------------------------------------------------------------*/
903 case OP_DOLL:
904 if ((md->moptions & PCRE_NOTEOL) == 0)
905 {
906 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
907 could_continue = TRUE;
908 else if (clen == 0 ||
909 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
910 (ptr == end_subject - md->nllen)
911 ))
912 { ADD_ACTIVE(state_offset + 1, 0); }
913 }
914 break;
915
916 /*-----------------------------------------------------------------*/
917 case OP_DOLLM:
918 if ((md->moptions & PCRE_NOTEOL) == 0)
919 {
920 if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
921 could_continue = TRUE;
922 else if (clen == 0 ||
923 ((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr)))
924 { ADD_ACTIVE(state_offset + 1, 0); }
925 }
926 else if (IS_NEWLINE(ptr))
927 { ADD_ACTIVE(state_offset + 1, 0); }
928 break;
929
930 /*-----------------------------------------------------------------*/
931
932 case OP_DIGIT:
933 case OP_WHITESPACE:
934 case OP_WORDCHAR:
935 if (clen > 0 && c < 256 &&
936 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0)
937 { ADD_NEW(state_offset + 1, 0); }
938 break;
939
940 /*-----------------------------------------------------------------*/
941 case OP_NOT_DIGIT:
942 case OP_NOT_WHITESPACE:
943 case OP_NOT_WORDCHAR:
944 if (clen > 0 && (c >= 256 ||
945 ((ctypes[c] & toptable1[codevalue]) ^ toptable2[codevalue]) != 0))
946 { ADD_NEW(state_offset + 1, 0); }
947 break;
948
949 /*-----------------------------------------------------------------*/
950 case OP_WORD_BOUNDARY:
951 case OP_NOT_WORD_BOUNDARY:
952 {
953 int left_word, right_word;
954
955 if (ptr > start_subject)
956 {
957 const uschar *temp = ptr - 1;
958 if (temp < md->start_used_ptr) md->start_used_ptr = temp;
959 #ifdef SUPPORT_UTF8
960 if (utf8) BACKCHAR(temp);
961 #endif
962 GETCHARTEST(d, temp);
963 #ifdef SUPPORT_UCP
964 if ((md->poptions & PCRE_UCP) != 0)
965 {
966 if (d == '_') left_word = TRUE; else
967 {
968 int cat = UCD_CATEGORY(d);
969 left_word = (cat == ucp_L || cat == ucp_N);
970 }
971 }
972 else
973 #endif
974 left_word = d < 256 && (ctypes[d] & ctype_word) != 0;
975 }
976 else left_word = FALSE;
977
978 if (clen > 0)
979 {
980 #ifdef SUPPORT_UCP
981 if ((md->poptions & PCRE_UCP) != 0)
982 {
983 if (c == '_') right_word = TRUE; else
984 {
985 int cat = UCD_CATEGORY(c);
986 right_word = (cat == ucp_L || cat == ucp_N);
987 }
988 }
989 else
990 #endif
991 right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
992 }
993 else right_word = FALSE;
994
995 if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
996 { ADD_ACTIVE(state_offset + 1, 0); }
997 }
998 break;
999
1000
1001 /*-----------------------------------------------------------------*/
1002 /* Check the next character by Unicode property. We will get here only
1003 if the support is in the binary; otherwise a compile-time error occurs.
1004 */
1005
1006 #ifdef SUPPORT_UCP
1007 case OP_PROP:
1008 case OP_NOTPROP:
1009 if (clen > 0)
1010 {
1011 BOOL OK;
1012 const ucd_record * prop = GET_UCD(c);
1013 switch(code[1])
1014 {
1015 case PT_ANY:
1016 OK = TRUE;
1017 break;
1018
1019 case PT_LAMP:
1020 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1021 prop->chartype == ucp_Lt;
1022 break;
1023
1024 case PT_GC:
1025 OK = _pcre_ucp_gentype[prop->chartype] == code[2];
1026 break;
1027
1028 case PT_PC:
1029 OK = prop->chartype == code[2];
1030 break;
1031
1032 case PT_SC:
1033 OK = prop->script == code[2];
1034 break;
1035
1036 /* These are specials for combination cases. */
1037
1038 case PT_ALNUM:
1039 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1040 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1041 break;
1042
1043 case PT_SPACE: /* Perl space */
1044 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1045 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1046 break;
1047
1048 case PT_PXSPACE: /* POSIX space */
1049 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1050 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1051 c == CHAR_FF || c == CHAR_CR;
1052 break;
1053
1054 case PT_WORD:
1055 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1056 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1057 c == CHAR_UNDERSCORE;
1058 break;
1059
1060 /* Should never occur, but keep compilers from grumbling. */
1061
1062 default:
1063 OK = codevalue != OP_PROP;
1064 break;
1065 }
1066
1067 if (OK == (codevalue == OP_PROP)) { ADD_NEW(state_offset + 3, 0); }
1068 }
1069 break;
1070 #endif
1071
1072
1073
1074 /* ========================================================================== */
1075 /* These opcodes likewise inspect the subject character, but have an
1076 argument that is not a data character. It is one of these opcodes:
1077 OP_ANY, OP_ALLANY, OP_DIGIT, OP_NOT_DIGIT, OP_WHITESPACE, OP_NOT_SPACE,
1078 OP_WORDCHAR, OP_NOT_WORDCHAR. The value is loaded into d. */
1079
1080 case OP_TYPEPLUS:
1081 case OP_TYPEMINPLUS:
1082 case OP_TYPEPOSPLUS:
1083 count = current_state->count; /* Already matched */
1084 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1085 if (clen > 0)
1086 {
1087 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1088 (c < 256 &&
1089 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1090 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1091 {
1092 if (count > 0 && codevalue == OP_TYPEPOSPLUS)
1093 {
1094 active_count--; /* Remove non-match possibility */
1095 next_active_state--;
1096 }
1097 count++;
1098 ADD_NEW(state_offset, count);
1099 }
1100 }
1101 break;
1102
1103 /*-----------------------------------------------------------------*/
1104 case OP_TYPEQUERY:
1105 case OP_TYPEMINQUERY:
1106 case OP_TYPEPOSQUERY:
1107 ADD_ACTIVE(state_offset + 2, 0);
1108 if (clen > 0)
1109 {
1110 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1111 (c < 256 &&
1112 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1113 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1114 {
1115 if (codevalue == OP_TYPEPOSQUERY)
1116 {
1117 active_count--; /* Remove non-match possibility */
1118 next_active_state--;
1119 }
1120 ADD_NEW(state_offset + 2, 0);
1121 }
1122 }
1123 break;
1124
1125 /*-----------------------------------------------------------------*/
1126 case OP_TYPESTAR:
1127 case OP_TYPEMINSTAR:
1128 case OP_TYPEPOSSTAR:
1129 ADD_ACTIVE(state_offset + 2, 0);
1130 if (clen > 0)
1131 {
1132 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1133 (c < 256 &&
1134 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1135 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1136 {
1137 if (codevalue == OP_TYPEPOSSTAR)
1138 {
1139 active_count--; /* Remove non-match possibility */
1140 next_active_state--;
1141 }
1142 ADD_NEW(state_offset, 0);
1143 }
1144 }
1145 break;
1146
1147 /*-----------------------------------------------------------------*/
1148 case OP_TYPEEXACT:
1149 count = current_state->count; /* Number already matched */
1150 if (clen > 0)
1151 {
1152 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1153 (c < 256 &&
1154 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1155 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1156 {
1157 if (++count >= GET2(code, 1))
1158 { ADD_NEW(state_offset + 4, 0); }
1159 else
1160 { ADD_NEW(state_offset, count); }
1161 }
1162 }
1163 break;
1164
1165 /*-----------------------------------------------------------------*/
1166 case OP_TYPEUPTO:
1167 case OP_TYPEMINUPTO:
1168 case OP_TYPEPOSUPTO:
1169 ADD_ACTIVE(state_offset + 4, 0);
1170 count = current_state->count; /* Number already matched */
1171 if (clen > 0)
1172 {
1173 if ((c >= 256 && d != OP_DIGIT && d != OP_WHITESPACE && d != OP_WORDCHAR) ||
1174 (c < 256 &&
1175 (d != OP_ANY || !IS_NEWLINE(ptr)) &&
1176 ((ctypes[c] & toptable1[d]) ^ toptable2[d]) != 0))
1177 {
1178 if (codevalue == OP_TYPEPOSUPTO)
1179 {
1180 active_count--; /* Remove non-match possibility */
1181 next_active_state--;
1182 }
1183 if (++count >= GET2(code, 1))
1184 { ADD_NEW(state_offset + 4, 0); }
1185 else
1186 { ADD_NEW(state_offset, count); }
1187 }
1188 }
1189 break;
1190
1191 /* ========================================================================== */
1192 /* These are virtual opcodes that are used when something like
1193 OP_TYPEPLUS has OP_PROP, OP_NOTPROP, OP_ANYNL, or OP_EXTUNI as its
1194 argument. It keeps the code above fast for the other cases. The argument
1195 is in the d variable. */
1196
1197 #ifdef SUPPORT_UCP
1198 case OP_PROP_EXTRA + OP_TYPEPLUS:
1199 case OP_PROP_EXTRA + OP_TYPEMINPLUS:
1200 case OP_PROP_EXTRA + OP_TYPEPOSPLUS:
1201 count = current_state->count; /* Already matched */
1202 if (count > 0) { ADD_ACTIVE(state_offset + 4, 0); }
1203 if (clen > 0)
1204 {
1205 BOOL OK;
1206 const ucd_record * prop = GET_UCD(c);
1207 switch(code[2])
1208 {
1209 case PT_ANY:
1210 OK = TRUE;
1211 break;
1212
1213 case PT_LAMP:
1214 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1215 prop->chartype == ucp_Lt;
1216 break;
1217
1218 case PT_GC:
1219 OK = _pcre_ucp_gentype[prop->chartype] == code[3];
1220 break;
1221
1222 case PT_PC:
1223 OK = prop->chartype == code[3];
1224 break;
1225
1226 case PT_SC:
1227 OK = prop->script == code[3];
1228 break;
1229
1230 /* These are specials for combination cases. */
1231
1232 case PT_ALNUM:
1233 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1234 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1235 break;
1236
1237 case PT_SPACE: /* Perl space */
1238 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1239 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1240 break;
1241
1242 case PT_PXSPACE: /* POSIX space */
1243 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1244 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1245 c == CHAR_FF || c == CHAR_CR;
1246 break;
1247
1248 case PT_WORD:
1249 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1250 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1251 c == CHAR_UNDERSCORE;
1252 break;
1253
1254 /* Should never occur, but keep compilers from grumbling. */
1255
1256 default:
1257 OK = codevalue != OP_PROP;
1258 break;
1259 }
1260
1261 if (OK == (d == OP_PROP))
1262 {
1263 if (count > 0 && codevalue == OP_PROP_EXTRA + OP_TYPEPOSPLUS)
1264 {
1265 active_count--; /* Remove non-match possibility */
1266 next_active_state--;
1267 }
1268 count++;
1269 ADD_NEW(state_offset, count);
1270 }
1271 }
1272 break;
1273
1274 /*-----------------------------------------------------------------*/
1275 case OP_EXTUNI_EXTRA + OP_TYPEPLUS:
1276 case OP_EXTUNI_EXTRA + OP_TYPEMINPLUS:
1277 case OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS:
1278 count = current_state->count; /* Already matched */
1279 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1280 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1281 {
1282 const uschar *nptr = ptr + clen;
1283 int ncount = 0;
1284 if (count > 0 && codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSPLUS)
1285 {
1286 active_count--; /* Remove non-match possibility */
1287 next_active_state--;
1288 }
1289 while (nptr < end_subject)
1290 {
1291 int nd;
1292 int ndlen = 1;
1293 GETCHARLEN(nd, nptr, ndlen);
1294 if (UCD_CATEGORY(nd) != ucp_M) break;
1295 ncount++;
1296 nptr += ndlen;
1297 }
1298 count++;
1299 ADD_NEW_DATA(-state_offset, count, ncount);
1300 }
1301 break;
1302 #endif
1303
1304 /*-----------------------------------------------------------------*/
1305 case OP_ANYNL_EXTRA + OP_TYPEPLUS:
1306 case OP_ANYNL_EXTRA + OP_TYPEMINPLUS:
1307 case OP_ANYNL_EXTRA + OP_TYPEPOSPLUS:
1308 count = current_state->count; /* Already matched */
1309 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1310 if (clen > 0)
1311 {
1312 int ncount = 0;
1313 switch (c)
1314 {
1315 case 0x000b:
1316 case 0x000c:
1317 case 0x0085:
1318 case 0x2028:
1319 case 0x2029:
1320 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1321 goto ANYNL01;
1322
1323 case 0x000d:
1324 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1325 /* Fall through */
1326
1327 ANYNL01:
1328 case 0x000a:
1329 if (count > 0 && codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSPLUS)
1330 {
1331 active_count--; /* Remove non-match possibility */
1332 next_active_state--;
1333 }
1334 count++;
1335 ADD_NEW_DATA(-state_offset, count, ncount);
1336 break;
1337
1338 default:
1339 break;
1340 }
1341 }
1342 break;
1343
1344 /*-----------------------------------------------------------------*/
1345 case OP_VSPACE_EXTRA + OP_TYPEPLUS:
1346 case OP_VSPACE_EXTRA + OP_TYPEMINPLUS:
1347 case OP_VSPACE_EXTRA + OP_TYPEPOSPLUS:
1348 count = current_state->count; /* Already matched */
1349 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1350 if (clen > 0)
1351 {
1352 BOOL OK;
1353 switch (c)
1354 {
1355 case 0x000a:
1356 case 0x000b:
1357 case 0x000c:
1358 case 0x000d:
1359 case 0x0085:
1360 case 0x2028:
1361 case 0x2029:
1362 OK = TRUE;
1363 break;
1364
1365 default:
1366 OK = FALSE;
1367 break;
1368 }
1369
1370 if (OK == (d == OP_VSPACE))
1371 {
1372 if (count > 0 && codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSPLUS)
1373 {
1374 active_count--; /* Remove non-match possibility */
1375 next_active_state--;
1376 }
1377 count++;
1378 ADD_NEW_DATA(-state_offset, count, 0);
1379 }
1380 }
1381 break;
1382
1383 /*-----------------------------------------------------------------*/
1384 case OP_HSPACE_EXTRA + OP_TYPEPLUS:
1385 case OP_HSPACE_EXTRA + OP_TYPEMINPLUS:
1386 case OP_HSPACE_EXTRA + OP_TYPEPOSPLUS:
1387 count = current_state->count; /* Already matched */
1388 if (count > 0) { ADD_ACTIVE(state_offset + 2, 0); }
1389 if (clen > 0)
1390 {
1391 BOOL OK;
1392 switch (c)
1393 {
1394 case 0x09: /* HT */
1395 case 0x20: /* SPACE */
1396 case 0xa0: /* NBSP */
1397 case 0x1680: /* OGHAM SPACE MARK */
1398 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1399 case 0x2000: /* EN QUAD */
1400 case 0x2001: /* EM QUAD */
1401 case 0x2002: /* EN SPACE */
1402 case 0x2003: /* EM SPACE */
1403 case 0x2004: /* THREE-PER-EM SPACE */
1404 case 0x2005: /* FOUR-PER-EM SPACE */
1405 case 0x2006: /* SIX-PER-EM SPACE */
1406 case 0x2007: /* FIGURE SPACE */
1407 case 0x2008: /* PUNCTUATION SPACE */
1408 case 0x2009: /* THIN SPACE */
1409 case 0x200A: /* HAIR SPACE */
1410 case 0x202f: /* NARROW NO-BREAK SPACE */
1411 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1412 case 0x3000: /* IDEOGRAPHIC SPACE */
1413 OK = TRUE;
1414 break;
1415
1416 default:
1417 OK = FALSE;
1418 break;
1419 }
1420
1421 if (OK == (d == OP_HSPACE))
1422 {
1423 if (count > 0 && codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSPLUS)
1424 {
1425 active_count--; /* Remove non-match possibility */
1426 next_active_state--;
1427 }
1428 count++;
1429 ADD_NEW_DATA(-state_offset, count, 0);
1430 }
1431 }
1432 break;
1433
1434 /*-----------------------------------------------------------------*/
1435 #ifdef SUPPORT_UCP
1436 case OP_PROP_EXTRA + OP_TYPEQUERY:
1437 case OP_PROP_EXTRA + OP_TYPEMINQUERY:
1438 case OP_PROP_EXTRA + OP_TYPEPOSQUERY:
1439 count = 4;
1440 goto QS1;
1441
1442 case OP_PROP_EXTRA + OP_TYPESTAR:
1443 case OP_PROP_EXTRA + OP_TYPEMINSTAR:
1444 case OP_PROP_EXTRA + OP_TYPEPOSSTAR:
1445 count = 0;
1446
1447 QS1:
1448
1449 ADD_ACTIVE(state_offset + 4, 0);
1450 if (clen > 0)
1451 {
1452 BOOL OK;
1453 const ucd_record * prop = GET_UCD(c);
1454 switch(code[2])
1455 {
1456 case PT_ANY:
1457 OK = TRUE;
1458 break;
1459
1460 case PT_LAMP:
1461 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1462 prop->chartype == ucp_Lt;
1463 break;
1464
1465 case PT_GC:
1466 OK = _pcre_ucp_gentype[prop->chartype] == code[3];
1467 break;
1468
1469 case PT_PC:
1470 OK = prop->chartype == code[3];
1471 break;
1472
1473 case PT_SC:
1474 OK = prop->script == code[3];
1475 break;
1476
1477 /* These are specials for combination cases. */
1478
1479 case PT_ALNUM:
1480 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1481 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1482 break;
1483
1484 case PT_SPACE: /* Perl space */
1485 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1486 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1487 break;
1488
1489 case PT_PXSPACE: /* POSIX space */
1490 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1491 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1492 c == CHAR_FF || c == CHAR_CR;
1493 break;
1494
1495 case PT_WORD:
1496 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1497 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1498 c == CHAR_UNDERSCORE;
1499 break;
1500
1501 /* Should never occur, but keep compilers from grumbling. */
1502
1503 default:
1504 OK = codevalue != OP_PROP;
1505 break;
1506 }
1507
1508 if (OK == (d == OP_PROP))
1509 {
1510 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSSTAR ||
1511 codevalue == OP_PROP_EXTRA + OP_TYPEPOSQUERY)
1512 {
1513 active_count--; /* Remove non-match possibility */
1514 next_active_state--;
1515 }
1516 ADD_NEW(state_offset + count, 0);
1517 }
1518 }
1519 break;
1520
1521 /*-----------------------------------------------------------------*/
1522 case OP_EXTUNI_EXTRA + OP_TYPEQUERY:
1523 case OP_EXTUNI_EXTRA + OP_TYPEMINQUERY:
1524 case OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY:
1525 count = 2;
1526 goto QS2;
1527
1528 case OP_EXTUNI_EXTRA + OP_TYPESTAR:
1529 case OP_EXTUNI_EXTRA + OP_TYPEMINSTAR:
1530 case OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR:
1531 count = 0;
1532
1533 QS2:
1534
1535 ADD_ACTIVE(state_offset + 2, 0);
1536 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1537 {
1538 const uschar *nptr = ptr + clen;
1539 int ncount = 0;
1540 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSSTAR ||
1541 codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSQUERY)
1542 {
1543 active_count--; /* Remove non-match possibility */
1544 next_active_state--;
1545 }
1546 while (nptr < end_subject)
1547 {
1548 int nd;
1549 int ndlen = 1;
1550 GETCHARLEN(nd, nptr, ndlen);
1551 if (UCD_CATEGORY(nd) != ucp_M) break;
1552 ncount++;
1553 nptr += ndlen;
1554 }
1555 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1556 }
1557 break;
1558 #endif
1559
1560 /*-----------------------------------------------------------------*/
1561 case OP_ANYNL_EXTRA + OP_TYPEQUERY:
1562 case OP_ANYNL_EXTRA + OP_TYPEMINQUERY:
1563 case OP_ANYNL_EXTRA + OP_TYPEPOSQUERY:
1564 count = 2;
1565 goto QS3;
1566
1567 case OP_ANYNL_EXTRA + OP_TYPESTAR:
1568 case OP_ANYNL_EXTRA + OP_TYPEMINSTAR:
1569 case OP_ANYNL_EXTRA + OP_TYPEPOSSTAR:
1570 count = 0;
1571
1572 QS3:
1573 ADD_ACTIVE(state_offset + 2, 0);
1574 if (clen > 0)
1575 {
1576 int ncount = 0;
1577 switch (c)
1578 {
1579 case 0x000b:
1580 case 0x000c:
1581 case 0x0085:
1582 case 0x2028:
1583 case 0x2029:
1584 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1585 goto ANYNL02;
1586
1587 case 0x000d:
1588 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1589 /* Fall through */
1590
1591 ANYNL02:
1592 case 0x000a:
1593 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSSTAR ||
1594 codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSQUERY)
1595 {
1596 active_count--; /* Remove non-match possibility */
1597 next_active_state--;
1598 }
1599 ADD_NEW_DATA(-(state_offset + count), 0, ncount);
1600 break;
1601
1602 default:
1603 break;
1604 }
1605 }
1606 break;
1607
1608 /*-----------------------------------------------------------------*/
1609 case OP_VSPACE_EXTRA + OP_TYPEQUERY:
1610 case OP_VSPACE_EXTRA + OP_TYPEMINQUERY:
1611 case OP_VSPACE_EXTRA + OP_TYPEPOSQUERY:
1612 count = 2;
1613 goto QS4;
1614
1615 case OP_VSPACE_EXTRA + OP_TYPESTAR:
1616 case OP_VSPACE_EXTRA + OP_TYPEMINSTAR:
1617 case OP_VSPACE_EXTRA + OP_TYPEPOSSTAR:
1618 count = 0;
1619
1620 QS4:
1621 ADD_ACTIVE(state_offset + 2, 0);
1622 if (clen > 0)
1623 {
1624 BOOL OK;
1625 switch (c)
1626 {
1627 case 0x000a:
1628 case 0x000b:
1629 case 0x000c:
1630 case 0x000d:
1631 case 0x0085:
1632 case 0x2028:
1633 case 0x2029:
1634 OK = TRUE;
1635 break;
1636
1637 default:
1638 OK = FALSE;
1639 break;
1640 }
1641 if (OK == (d == OP_VSPACE))
1642 {
1643 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSSTAR ||
1644 codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSQUERY)
1645 {
1646 active_count--; /* Remove non-match possibility */
1647 next_active_state--;
1648 }
1649 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1650 }
1651 }
1652 break;
1653
1654 /*-----------------------------------------------------------------*/
1655 case OP_HSPACE_EXTRA + OP_TYPEQUERY:
1656 case OP_HSPACE_EXTRA + OP_TYPEMINQUERY:
1657 case OP_HSPACE_EXTRA + OP_TYPEPOSQUERY:
1658 count = 2;
1659 goto QS5;
1660
1661 case OP_HSPACE_EXTRA + OP_TYPESTAR:
1662 case OP_HSPACE_EXTRA + OP_TYPEMINSTAR:
1663 case OP_HSPACE_EXTRA + OP_TYPEPOSSTAR:
1664 count = 0;
1665
1666 QS5:
1667 ADD_ACTIVE(state_offset + 2, 0);
1668 if (clen > 0)
1669 {
1670 BOOL OK;
1671 switch (c)
1672 {
1673 case 0x09: /* HT */
1674 case 0x20: /* SPACE */
1675 case 0xa0: /* NBSP */
1676 case 0x1680: /* OGHAM SPACE MARK */
1677 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1678 case 0x2000: /* EN QUAD */
1679 case 0x2001: /* EM QUAD */
1680 case 0x2002: /* EN SPACE */
1681 case 0x2003: /* EM SPACE */
1682 case 0x2004: /* THREE-PER-EM SPACE */
1683 case 0x2005: /* FOUR-PER-EM SPACE */
1684 case 0x2006: /* SIX-PER-EM SPACE */
1685 case 0x2007: /* FIGURE SPACE */
1686 case 0x2008: /* PUNCTUATION SPACE */
1687 case 0x2009: /* THIN SPACE */
1688 case 0x200A: /* HAIR SPACE */
1689 case 0x202f: /* NARROW NO-BREAK SPACE */
1690 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1691 case 0x3000: /* IDEOGRAPHIC SPACE */
1692 OK = TRUE;
1693 break;
1694
1695 default:
1696 OK = FALSE;
1697 break;
1698 }
1699
1700 if (OK == (d == OP_HSPACE))
1701 {
1702 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSSTAR ||
1703 codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSQUERY)
1704 {
1705 active_count--; /* Remove non-match possibility */
1706 next_active_state--;
1707 }
1708 ADD_NEW_DATA(-(state_offset + count), 0, 0);
1709 }
1710 }
1711 break;
1712
1713 /*-----------------------------------------------------------------*/
1714 #ifdef SUPPORT_UCP
1715 case OP_PROP_EXTRA + OP_TYPEEXACT:
1716 case OP_PROP_EXTRA + OP_TYPEUPTO:
1717 case OP_PROP_EXTRA + OP_TYPEMINUPTO:
1718 case OP_PROP_EXTRA + OP_TYPEPOSUPTO:
1719 if (codevalue != OP_PROP_EXTRA + OP_TYPEEXACT)
1720 { ADD_ACTIVE(state_offset + 6, 0); }
1721 count = current_state->count; /* Number already matched */
1722 if (clen > 0)
1723 {
1724 BOOL OK;
1725 const ucd_record * prop = GET_UCD(c);
1726 switch(code[4])
1727 {
1728 case PT_ANY:
1729 OK = TRUE;
1730 break;
1731
1732 case PT_LAMP:
1733 OK = prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
1734 prop->chartype == ucp_Lt;
1735 break;
1736
1737 case PT_GC:
1738 OK = _pcre_ucp_gentype[prop->chartype] == code[5];
1739 break;
1740
1741 case PT_PC:
1742 OK = prop->chartype == code[5];
1743 break;
1744
1745 case PT_SC:
1746 OK = prop->script == code[5];
1747 break;
1748
1749 /* These are specials for combination cases. */
1750
1751 case PT_ALNUM:
1752 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1753 _pcre_ucp_gentype[prop->chartype] == ucp_N;
1754 break;
1755
1756 case PT_SPACE: /* Perl space */
1757 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1758 c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR;
1759 break;
1760
1761 case PT_PXSPACE: /* POSIX space */
1762 OK = _pcre_ucp_gentype[prop->chartype] == ucp_Z ||
1763 c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
1764 c == CHAR_FF || c == CHAR_CR;
1765 break;
1766
1767 case PT_WORD:
1768 OK = _pcre_ucp_gentype[prop->chartype] == ucp_L ||
1769 _pcre_ucp_gentype[prop->chartype] == ucp_N ||
1770 c == CHAR_UNDERSCORE;
1771 break;
1772
1773 /* Should never occur, but keep compilers from grumbling. */
1774
1775 default:
1776 OK = codevalue != OP_PROP;
1777 break;
1778 }
1779
1780 if (OK == (d == OP_PROP))
1781 {
1782 if (codevalue == OP_PROP_EXTRA + OP_TYPEPOSUPTO)
1783 {
1784 active_count--; /* Remove non-match possibility */
1785 next_active_state--;
1786 }
1787 if (++count >= GET2(code, 1))
1788 { ADD_NEW(state_offset + 6, 0); }
1789 else
1790 { ADD_NEW(state_offset, count); }
1791 }
1792 }
1793 break;
1794
1795 /*-----------------------------------------------------------------*/
1796 case OP_EXTUNI_EXTRA + OP_TYPEEXACT:
1797 case OP_EXTUNI_EXTRA + OP_TYPEUPTO:
1798 case OP_EXTUNI_EXTRA + OP_TYPEMINUPTO:
1799 case OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO:
1800 if (codevalue != OP_EXTUNI_EXTRA + OP_TYPEEXACT)
1801 { ADD_ACTIVE(state_offset + 4, 0); }
1802 count = current_state->count; /* Number already matched */
1803 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
1804 {
1805 const uschar *nptr = ptr + clen;
1806 int ncount = 0;
1807 if (codevalue == OP_EXTUNI_EXTRA + OP_TYPEPOSUPTO)
1808 {
1809 active_count--; /* Remove non-match possibility */
1810 next_active_state--;
1811 }
1812 while (nptr < end_subject)
1813 {
1814 int nd;
1815 int ndlen = 1;
1816 GETCHARLEN(nd, nptr, ndlen);
1817 if (UCD_CATEGORY(nd) != ucp_M) break;
1818 ncount++;
1819 nptr += ndlen;
1820 }
1821 if (++count >= GET2(code, 1))
1822 { ADD_NEW_DATA(-(state_offset + 4), 0, ncount); }
1823 else
1824 { ADD_NEW_DATA(-state_offset, count, ncount); }
1825 }
1826 break;
1827 #endif
1828
1829 /*-----------------------------------------------------------------*/
1830 case OP_ANYNL_EXTRA + OP_TYPEEXACT:
1831 case OP_ANYNL_EXTRA + OP_TYPEUPTO:
1832 case OP_ANYNL_EXTRA + OP_TYPEMINUPTO:
1833 case OP_ANYNL_EXTRA + OP_TYPEPOSUPTO:
1834 if (codevalue != OP_ANYNL_EXTRA + OP_TYPEEXACT)
1835 { ADD_ACTIVE(state_offset + 4, 0); }
1836 count = current_state->count; /* Number already matched */
1837 if (clen > 0)
1838 {
1839 int ncount = 0;
1840 switch (c)
1841 {
1842 case 0x000b:
1843 case 0x000c:
1844 case 0x0085:
1845 case 0x2028:
1846 case 0x2029:
1847 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
1848 goto ANYNL03;
1849
1850 case 0x000d:
1851 if (ptr + 1 < end_subject && ptr[1] == 0x0a) ncount = 1;
1852 /* Fall through */
1853
1854 ANYNL03:
1855 case 0x000a:
1856 if (codevalue == OP_ANYNL_EXTRA + OP_TYPEPOSUPTO)
1857 {
1858 active_count--; /* Remove non-match possibility */
1859 next_active_state--;
1860 }
1861 if (++count >= GET2(code, 1))
1862 { ADD_NEW_DATA(-(state_offset + 4), 0, ncount); }
1863 else
1864 { ADD_NEW_DATA(-state_offset, count, ncount); }
1865 break;
1866
1867 default:
1868 break;
1869 }
1870 }
1871 break;
1872
1873 /*-----------------------------------------------------------------*/
1874 case OP_VSPACE_EXTRA + OP_TYPEEXACT:
1875 case OP_VSPACE_EXTRA + OP_TYPEUPTO:
1876 case OP_VSPACE_EXTRA + OP_TYPEMINUPTO:
1877 case OP_VSPACE_EXTRA + OP_TYPEPOSUPTO:
1878 if (codevalue != OP_VSPACE_EXTRA + OP_TYPEEXACT)
1879 { ADD_ACTIVE(state_offset + 4, 0); }
1880 count = current_state->count; /* Number already matched */
1881 if (clen > 0)
1882 {
1883 BOOL OK;
1884 switch (c)
1885 {
1886 case 0x000a:
1887 case 0x000b:
1888 case 0x000c:
1889 case 0x000d:
1890 case 0x0085:
1891 case 0x2028:
1892 case 0x2029:
1893 OK = TRUE;
1894 break;
1895
1896 default:
1897 OK = FALSE;
1898 }
1899
1900 if (OK == (d == OP_VSPACE))
1901 {
1902 if (codevalue == OP_VSPACE_EXTRA + OP_TYPEPOSUPTO)
1903 {
1904 active_count--; /* Remove non-match possibility */
1905 next_active_state--;
1906 }
1907 if (++count >= GET2(code, 1))
1908 { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
1909 else
1910 { ADD_NEW_DATA(-state_offset, count, 0); }
1911 }
1912 }
1913 break;
1914
1915 /*-----------------------------------------------------------------*/
1916 case OP_HSPACE_EXTRA + OP_TYPEEXACT:
1917 case OP_HSPACE_EXTRA + OP_TYPEUPTO:
1918 case OP_HSPACE_EXTRA + OP_TYPEMINUPTO:
1919 case OP_HSPACE_EXTRA + OP_TYPEPOSUPTO:
1920 if (codevalue != OP_HSPACE_EXTRA + OP_TYPEEXACT)
1921 { ADD_ACTIVE(state_offset + 4, 0); }
1922 count = current_state->count; /* Number already matched */
1923 if (clen > 0)
1924 {
1925 BOOL OK;
1926 switch (c)
1927 {
1928 case 0x09: /* HT */
1929 case 0x20: /* SPACE */
1930 case 0xa0: /* NBSP */
1931 case 0x1680: /* OGHAM SPACE MARK */
1932 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
1933 case 0x2000: /* EN QUAD */
1934 case 0x2001: /* EM QUAD */
1935 case 0x2002: /* EN SPACE */
1936 case 0x2003: /* EM SPACE */
1937 case 0x2004: /* THREE-PER-EM SPACE */
1938 case 0x2005: /* FOUR-PER-EM SPACE */
1939 case 0x2006: /* SIX-PER-EM SPACE */
1940 case 0x2007: /* FIGURE SPACE */
1941 case 0x2008: /* PUNCTUATION SPACE */
1942 case 0x2009: /* THIN SPACE */
1943 case 0x200A: /* HAIR SPACE */
1944 case 0x202f: /* NARROW NO-BREAK SPACE */
1945 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
1946 case 0x3000: /* IDEOGRAPHIC SPACE */
1947 OK = TRUE;
1948 break;
1949
1950 default:
1951 OK = FALSE;
1952 break;
1953 }
1954
1955 if (OK == (d == OP_HSPACE))
1956 {
1957 if (codevalue == OP_HSPACE_EXTRA + OP_TYPEPOSUPTO)
1958 {
1959 active_count--; /* Remove non-match possibility */
1960 next_active_state--;
1961 }
1962 if (++count >= GET2(code, 1))
1963 { ADD_NEW_DATA(-(state_offset + 4), 0, 0); }
1964 else
1965 { ADD_NEW_DATA(-state_offset, count, 0); }
1966 }
1967 }
1968 break;
1969
1970 /* ========================================================================== */
1971 /* These opcodes are followed by a character that is usually compared
1972 to the current subject character; it is loaded into d. We still get
1973 here even if there is no subject character, because in some cases zero
1974 repetitions are permitted. */
1975
1976 /*-----------------------------------------------------------------*/
1977 case OP_CHAR:
1978 if (clen > 0 && c == d) { ADD_NEW(state_offset + dlen + 1, 0); }
1979 break;
1980
1981 /*-----------------------------------------------------------------*/
1982 case OP_CHARI:
1983 if (clen == 0) break;
1984
1985 #ifdef SUPPORT_UTF8
1986 if (utf8)
1987 {
1988 if (c == d) { ADD_NEW(state_offset + dlen + 1, 0); } else
1989 {
1990 unsigned int othercase;
1991 if (c < 128) othercase = fcc[c]; else
1992
1993 /* If we have Unicode property support, we can use it to test the
1994 other case of the character. */
1995
1996 #ifdef SUPPORT_UCP
1997 othercase = UCD_OTHERCASE(c);
1998 #else
1999 othercase = NOTACHAR;
2000 #endif
2001
2002 if (d == othercase) { ADD_NEW(state_offset + dlen + 1, 0); }
2003 }
2004 }
2005 else
2006 #endif /* SUPPORT_UTF8 */
2007
2008 /* Non-UTF-8 mode */
2009 {
2010 if (lcc[c] == lcc[d]) { ADD_NEW(state_offset + 2, 0); }
2011 }
2012 break;
2013
2014
2015 #ifdef SUPPORT_UCP
2016 /*-----------------------------------------------------------------*/
2017 /* This is a tricky one because it can match more than one character.
2018 Find out how many characters to skip, and then set up a negative state
2019 to wait for them to pass before continuing. */
2020
2021 case OP_EXTUNI:
2022 if (clen > 0 && UCD_CATEGORY(c) != ucp_M)
2023 {
2024 const uschar *nptr = ptr + clen;
2025 int ncount = 0;
2026 while (nptr < end_subject)
2027 {
2028 int nclen = 1;
2029 GETCHARLEN(c, nptr, nclen);
2030 if (UCD_CATEGORY(c) != ucp_M) break;
2031 ncount++;
2032 nptr += nclen;
2033 }
2034 ADD_NEW_DATA(-(state_offset + 1), 0, ncount);
2035 }
2036 break;
2037 #endif
2038
2039 /*-----------------------------------------------------------------*/
2040 /* This is a tricky like EXTUNI because it too can match more than one
2041 character (when CR is followed by LF). In this case, set up a negative
2042 state to wait for one character to pass before continuing. */
2043
2044 case OP_ANYNL:
2045 if (clen > 0) switch(c)
2046 {
2047 case 0x000b:
2048 case 0x000c:
2049 case 0x0085:
2050 case 0x2028:
2051 case 0x2029:
2052 if ((md->moptions & PCRE_BSR_ANYCRLF) != 0) break;
2053
2054 case 0x000a:
2055 ADD_NEW(state_offset + 1, 0);
2056 break;
2057
2058 case 0x000d:
2059 if (ptr + 1 < end_subject && ptr[1] == 0x0a)
2060 {
2061 ADD_NEW_DATA(-(state_offset + 1), 0, 1);
2062 }
2063 else
2064 {
2065 ADD_NEW(state_offset + 1, 0);
2066 }
2067 break;
2068 }
2069 break;
2070
2071 /*-----------------------------------------------------------------*/
2072 case OP_NOT_VSPACE:
2073 if (clen > 0) switch(c)
2074 {
2075 case 0x000a:
2076 case 0x000b:
2077 case 0x000c:
2078 case 0x000d:
2079 case 0x0085:
2080 case 0x2028:
2081 case 0x2029:
2082 break;
2083
2084 default:
2085 ADD_NEW(state_offset + 1, 0);
2086 break;
2087 }
2088 break;
2089
2090 /*-----------------------------------------------------------------*/
2091 case OP_VSPACE:
2092 if (clen > 0) switch(c)
2093 {
2094 case 0x000a:
2095 case 0x000b:
2096 case 0x000c:
2097 case 0x000d:
2098 case 0x0085:
2099 case 0x2028:
2100 case 0x2029:
2101 ADD_NEW(state_offset + 1, 0);
2102 break;
2103
2104 default: break;
2105 }
2106 break;
2107
2108 /*-----------------------------------------------------------------*/
2109 case OP_NOT_HSPACE:
2110 if (clen > 0) switch(c)
2111 {
2112 case 0x09: /* HT */
2113 case 0x20: /* SPACE */
2114 case 0xa0: /* NBSP */
2115 case 0x1680: /* OGHAM SPACE MARK */
2116 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2117 case 0x2000: /* EN QUAD */
2118 case 0x2001: /* EM QUAD */
2119 case 0x2002: /* EN SPACE */
2120 case 0x2003: /* EM SPACE */
2121 case 0x2004: /* THREE-PER-EM SPACE */
2122 case 0x2005: /* FOUR-PER-EM SPACE */
2123 case 0x2006: /* SIX-PER-EM SPACE */
2124 case 0x2007: /* FIGURE SPACE */
2125 case 0x2008: /* PUNCTUATION SPACE */
2126 case 0x2009: /* THIN SPACE */
2127 case 0x200A: /* HAIR SPACE */
2128 case 0x202f: /* NARROW NO-BREAK SPACE */
2129 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2130 case 0x3000: /* IDEOGRAPHIC SPACE */
2131 break;
2132
2133 default:
2134 ADD_NEW(state_offset + 1, 0);
2135 break;
2136 }
2137 break;
2138
2139 /*-----------------------------------------------------------------*/
2140 case OP_HSPACE:
2141 if (clen > 0) switch(c)
2142 {
2143 case 0x09: /* HT */
2144 case 0x20: /* SPACE */
2145 case 0xa0: /* NBSP */
2146 case 0x1680: /* OGHAM SPACE MARK */
2147 case 0x180e: /* MONGOLIAN VOWEL SEPARATOR */
2148 case 0x2000: /* EN QUAD */
2149 case 0x2001: /* EM QUAD */
2150 case 0x2002: /* EN SPACE */
2151 case 0x2003: /* EM SPACE */
2152 case 0x2004: /* THREE-PER-EM SPACE */
2153 case 0x2005: /* FOUR-PER-EM SPACE */
2154 case 0x2006: /* SIX-PER-EM SPACE */
2155 case 0x2007: /* FIGURE SPACE */
2156 case 0x2008: /* PUNCTUATION SPACE */
2157 case 0x2009: /* THIN SPACE */
2158 case 0x200A: /* HAIR SPACE */
2159 case 0x202f: /* NARROW NO-BREAK SPACE */
2160 case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
2161 case 0x3000: /* IDEOGRAPHIC SPACE */
2162 ADD_NEW(state_offset + 1, 0);
2163 break;
2164 }
2165 break;
2166
2167 /*-----------------------------------------------------------------*/
2168 /* Match a negated single character casefully. This is only used for
2169 one-byte characters, that is, we know that d < 256. The character we are
2170 checking (c) can be multibyte. */
2171
2172 case OP_NOT:
2173 if (clen > 0 && c != d) { ADD_NEW(state_offset + dlen + 1, 0); }
2174 break;
2175
2176 /*-----------------------------------------------------------------*/
2177 /* Match a negated single character caselessly. This is only used for
2178 one-byte characters, that is, we know that d < 256. The character we are
2179 checking (c) can be multibyte. */
2180
2181 case OP_NOTI:
2182 if (clen > 0 && c != d && c != fcc[d])
2183 { ADD_NEW(state_offset + dlen + 1, 0); }
2184 break;
2185
2186 /*-----------------------------------------------------------------*/
2187 case OP_PLUSI:
2188 case OP_MINPLUSI:
2189 case OP_POSPLUSI:
2190 case OP_NOTPLUSI:
2191 case OP_NOTMINPLUSI:
2192 case OP_NOTPOSPLUSI:
2193 caseless = TRUE;
2194 codevalue -= OP_STARI - OP_STAR;
2195
2196 /* Fall through */
2197 case OP_PLUS:
2198 case OP_MINPLUS:
2199 case OP_POSPLUS:
2200 case OP_NOTPLUS:
2201 case OP_NOTMINPLUS:
2202 case OP_NOTPOSPLUS:
2203 count = current_state->count; /* Already matched */
2204 if (count > 0) { ADD_ACTIVE(state_offset + dlen + 1, 0); }
2205 if (clen > 0)
2206 {
2207 unsigned int otherd = NOTACHAR;
2208 if (caseless)
2209 {
2210 #ifdef SUPPORT_UTF8
2211 if (utf8 && d >= 128)
2212 {
2213 #ifdef SUPPORT_UCP
2214 otherd = UCD_OTHERCASE(d);
2215 #endif /* SUPPORT_UCP */
2216 }
2217 else
2218 #endif /* SUPPORT_UTF8 */
2219 otherd = fcc[d];
2220 }
2221 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2222 {
2223 if (count > 0 &&
2224 (codevalue == OP_POSPLUS || codevalue == OP_NOTPOSPLUS))
2225 {
2226 active_count--; /* Remove non-match possibility */
2227 next_active_state--;
2228 }
2229 count++;
2230 ADD_NEW(state_offset, count);
2231 }
2232 }
2233 break;
2234
2235 /*-----------------------------------------------------------------*/
2236 case OP_QUERYI:
2237 case OP_MINQUERYI:
2238 case OP_POSQUERYI:
2239 case OP_NOTQUERYI:
2240 case OP_NOTMINQUERYI:
2241 case OP_NOTPOSQUERYI:
2242 caseless = TRUE;
2243 codevalue -= OP_STARI - OP_STAR;
2244 /* Fall through */
2245 case OP_QUERY:
2246 case OP_MINQUERY:
2247 case OP_POSQUERY:
2248 case OP_NOTQUERY:
2249 case OP_NOTMINQUERY:
2250 case OP_NOTPOSQUERY:
2251 ADD_ACTIVE(state_offset + dlen + 1, 0);
2252 if (clen > 0)
2253 {
2254 unsigned int otherd = NOTACHAR;
2255 if (caseless)
2256 {
2257 #ifdef SUPPORT_UTF8
2258 if (utf8 && d >= 128)
2259 {
2260 #ifdef SUPPORT_UCP
2261 otherd = UCD_OTHERCASE(d);
2262 #endif /* SUPPORT_UCP */
2263 }
2264 else
2265 #endif /* SUPPORT_UTF8 */
2266 otherd = fcc[d];
2267 }
2268 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2269 {
2270 if (codevalue == OP_POSQUERY || codevalue == OP_NOTPOSQUERY)
2271 {
2272 active_count--; /* Remove non-match possibility */
2273 next_active_state--;
2274 }
2275 ADD_NEW(state_offset + dlen + 1, 0);
2276 }
2277 }
2278 break;
2279
2280 /*-----------------------------------------------------------------*/
2281 case OP_STARI:
2282 case OP_MINSTARI:
2283 case OP_POSSTARI:
2284 case OP_NOTSTARI:
2285 case OP_NOTMINSTARI:
2286 case OP_NOTPOSSTARI:
2287 caseless = TRUE;
2288 codevalue -= OP_STARI - OP_STAR;
2289 /* Fall through */
2290 case OP_STAR:
2291 case OP_MINSTAR:
2292 case OP_POSSTAR:
2293 case OP_NOTSTAR:
2294 case OP_NOTMINSTAR:
2295 case OP_NOTPOSSTAR:
2296 ADD_ACTIVE(state_offset + dlen + 1, 0);
2297 if (clen > 0)
2298 {
2299 unsigned int otherd = NOTACHAR;
2300 if (caseless)
2301 {
2302 #ifdef SUPPORT_UTF8
2303 if (utf8 && d >= 128)
2304 {
2305 #ifdef SUPPORT_UCP
2306 otherd = UCD_OTHERCASE(d);
2307 #endif /* SUPPORT_UCP */
2308 }
2309 else
2310 #endif /* SUPPORT_UTF8 */
2311 otherd = fcc[d];
2312 }
2313 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2314 {
2315 if (codevalue == OP_POSSTAR || codevalue == OP_NOTPOSSTAR)
2316 {
2317 active_count--; /* Remove non-match possibility */
2318 next_active_state--;
2319 }
2320 ADD_NEW(state_offset, 0);
2321 }
2322 }
2323 break;
2324
2325 /*-----------------------------------------------------------------*/
2326 case OP_EXACTI:
2327 case OP_NOTEXACTI:
2328 caseless = TRUE;
2329 codevalue -= OP_STARI - OP_STAR;
2330 /* Fall through */
2331 case OP_EXACT:
2332 case OP_NOTEXACT:
2333 count = current_state->count; /* Number already matched */
2334 if (clen > 0)
2335 {
2336 unsigned int otherd = NOTACHAR;
2337 if (caseless)
2338 {
2339 #ifdef SUPPORT_UTF8
2340 if (utf8 && d >= 128)
2341 {
2342 #ifdef SUPPORT_UCP
2343 otherd = UCD_OTHERCASE(d);
2344 #endif /* SUPPORT_UCP */
2345 }
2346 else
2347 #endif /* SUPPORT_UTF8 */
2348 otherd = fcc[d];
2349 }
2350 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2351 {
2352 if (++count >= GET2(code, 1))
2353 { ADD_NEW(state_offset + dlen + 3, 0); }
2354 else
2355 { ADD_NEW(state_offset, count); }
2356 }
2357 }
2358 break;
2359
2360 /*-----------------------------------------------------------------*/
2361 case OP_UPTOI:
2362 case OP_MINUPTOI:
2363 case OP_POSUPTOI:
2364 case OP_NOTUPTOI:
2365 case OP_NOTMINUPTOI:
2366 case OP_NOTPOSUPTOI:
2367 caseless = TRUE;
2368 codevalue -= OP_STARI - OP_STAR;
2369 /* Fall through */
2370 case OP_UPTO:
2371 case OP_MINUPTO:
2372 case OP_POSUPTO:
2373 case OP_NOTUPTO:
2374 case OP_NOTMINUPTO:
2375 case OP_NOTPOSUPTO:
2376 ADD_ACTIVE(state_offset + dlen + 3, 0);
2377 count = current_state->count; /* Number already matched */
2378 if (clen > 0)
2379 {
2380 unsigned int otherd = NOTACHAR;
2381 if (caseless)
2382 {
2383 #ifdef SUPPORT_UTF8
2384 if (utf8 && d >= 128)
2385 {
2386 #ifdef SUPPORT_UCP
2387 otherd = UCD_OTHERCASE(d);
2388 #endif /* SUPPORT_UCP */
2389 }
2390 else
2391 #endif /* SUPPORT_UTF8 */
2392 otherd = fcc[d];
2393 }
2394 if ((c == d || c == otherd) == (codevalue < OP_NOTSTAR))
2395 {
2396 if (codevalue == OP_POSUPTO || codevalue == OP_NOTPOSUPTO)
2397 {
2398 active_count--; /* Remove non-match possibility */
2399 next_active_state--;
2400 }
2401 if (++count >= GET2(code, 1))
2402 { ADD_NEW(state_offset + dlen + 3, 0); }
2403 else
2404 { ADD_NEW(state_offset, count); }
2405 }
2406 }
2407 break;
2408
2409
2410 /* ========================================================================== */
2411 /* These are the class-handling opcodes */
2412
2413 case OP_CLASS:
2414 case OP_NCLASS:
2415 case OP_XCLASS:
2416 {
2417 BOOL isinclass = FALSE;
2418 int next_state_offset;
2419 const uschar *ecode;
2420
2421 /* For a simple class, there is always just a 32-byte table, and we
2422 can set isinclass from it. */
2423
2424 if (codevalue != OP_XCLASS)
2425 {
2426 ecode = code + 33;
2427 if (clen > 0)
2428 {
2429 isinclass = (c > 255)? (codevalue == OP_NCLASS) :
2430 ((code[1 + c/8] & (1 << (c&7))) != 0);
2431 }
2432 }
2433
2434 /* An extended class may have a table or a list of single characters,
2435 ranges, or both, and it may be positive or negative. There's a
2436 function that sorts all this out. */
2437
2438 else
2439 {
2440 ecode = code + GET(code, 1);
2441 if (clen > 0) isinclass = _pcre_xclass(c, code + 1 + LINK_SIZE);
2442 }
2443
2444 /* At this point, isinclass is set for all kinds of class, and ecode
2445 points to the byte after the end of the class. If there is a
2446 quantifier, this is where it will be. */
2447
2448 next_state_offset = (int)(ecode - start_code);
2449
2450 switch (*ecode)
2451 {
2452 case OP_CRSTAR:
2453 case OP_CRMINSTAR:
2454 ADD_ACTIVE(next_state_offset + 1, 0);
2455 if (isinclass) { ADD_NEW(state_offset, 0); }
2456 break;
2457
2458 case OP_CRPLUS:
2459 case OP_CRMINPLUS:
2460 count = current_state->count; /* Already matched */
2461 if (count > 0) { ADD_ACTIVE(next_state_offset + 1, 0); }
2462 if (isinclass) { count++; ADD_NEW(state_offset, count); }
2463 break;
2464
2465 case OP_CRQUERY:
2466 case OP_CRMINQUERY:
2467 ADD_ACTIVE(next_state_offset + 1, 0);
2468 if (isinclass) { ADD_NEW(next_state_offset + 1, 0); }
2469 break;
2470
2471 case OP_CRRANGE:
2472 case OP_CRMINRANGE:
2473 count = current_state->count; /* Already matched */
2474 if (count >= GET2(ecode, 1))
2475 { ADD_ACTIVE(next_state_offset + 5, 0); }
2476 if (isinclass)
2477 {
2478 int max = GET2(ecode, 3);
2479 if (++count >= max && max != 0) /* Max 0 => no limit */
2480 { ADD_NEW(next_state_offset + 5, 0); }
2481 else
2482 { ADD_NEW(state_offset, count); }
2483 }
2484 break;
2485
2486 default:
2487 if (isinclass) { ADD_NEW(next_state_offset, 0); }
2488 break;
2489 }
2490 }
2491 break;
2492
2493 /* ========================================================================== */
2494 /* These are the opcodes for fancy brackets of various kinds. We have
2495 to use recursion in order to handle them. The "always failing" assertion
2496 (?!) is optimised to OP_FAIL when compiling, so we have to support that,
2497 though the other "backtracking verbs" are not supported. */
2498
2499 case OP_FAIL:
2500 forced_fail++; /* Count FAILs for multiple states */
2501 break;
2502
2503 case OP_ASSERT:
2504 case OP_ASSERT_NOT:
2505 case OP_ASSERTBACK:
2506 case OP_ASSERTBACK_NOT:
2507 {
2508 int rc;
2509 int local_offsets[2];
2510 int local_workspace[1000];
2511 const uschar *endasscode = code + GET(code, 1);
2512
2513 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2514
2515 rc = internal_dfa_exec(
2516 md, /* static match data */
2517 code, /* this subexpression's code */
2518 ptr, /* where we currently are */
2519 (int)(ptr - start_subject), /* start offset */
2520 local_offsets, /* offset vector */
2521 sizeof(local_offsets)/sizeof(int), /* size of same */
2522 local_workspace, /* workspace vector */
2523 sizeof(local_workspace)/sizeof(int), /* size of same */
2524 rlevel); /* function recursion level */
2525
2526 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2527 if ((rc >= 0) == (codevalue == OP_ASSERT || codevalue == OP_ASSERTBACK))
2528 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2529 }
2530 break;
2531
2532 /*-----------------------------------------------------------------*/
2533 case OP_COND:
2534 case OP_SCOND:
2535 {
2536 int local_offsets[1000];
2537 int local_workspace[1000];
2538 int codelink = GET(code, 1);
2539 int condcode;
2540
2541 /* Because of the way auto-callout works during compile, a callout item
2542 is inserted between OP_COND and an assertion condition. This does not
2543 happen for the other conditions. */
2544
2545 if (code[LINK_SIZE+1] == OP_CALLOUT)
2546 {
2547 rrc = 0;
2548 if (pcre_callout != NULL)
2549 {
2550 pcre_callout_block cb;
2551 cb.version = 1; /* Version 1 of the callout block */
2552 cb.callout_number = code[LINK_SIZE+2];
2553 cb.offset_vector = offsets;
2554 cb.subject = (PCRE_SPTR)start_subject;
2555 cb.subject_length = (int)(end_subject - start_subject);
2556 cb.start_match = (int)(current_subject - start_subject);
2557 cb.current_position = (int)(ptr - start_subject);
2558 cb.pattern_position = GET(code, LINK_SIZE + 3);
2559 cb.next_item_length = GET(code, 3 + 2*LINK_SIZE);
2560 cb.capture_top = 1;
2561 cb.capture_last = -1;
2562 cb.callout_data = md->callout_data;
2563 if ((rrc = (*pcre_callout)(&cb)) < 0) return rrc; /* Abandon */
2564 }
2565 if (rrc > 0) break; /* Fail this thread */
2566 code += _pcre_OP_lengths[OP_CALLOUT]; /* Skip callout data */
2567 }
2568
2569 condcode = code[LINK_SIZE+1];
2570
2571 /* Back reference conditions are not supported */
2572
2573 if (condcode == OP_CREF || condcode == OP_NCREF)
2574 return PCRE_ERROR_DFA_UCOND;
2575
2576 /* The DEFINE condition is always false */
2577
2578 if (condcode == OP_DEF)
2579 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2580
2581 /* The only supported version of OP_RREF is for the value RREF_ANY,
2582 which means "test if in any recursion". We can't test for specifically
2583 recursed groups. */
2584
2585 else if (condcode == OP_RREF || condcode == OP_NRREF)
2586 {
2587 int value = GET2(code, LINK_SIZE+2);
2588 if (value != RREF_ANY) return PCRE_ERROR_DFA_UCOND;
2589 if (md->recursive != NULL)
2590 { ADD_ACTIVE(state_offset + LINK_SIZE + 4, 0); }
2591 else { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2592 }
2593
2594 /* Otherwise, the condition is an assertion */
2595
2596 else
2597 {
2598 int rc;
2599 const uschar *asscode = code + LINK_SIZE + 1;
2600 const uschar *endasscode = asscode + GET(asscode, 1);
2601
2602 while (*endasscode == OP_ALT) endasscode += GET(endasscode, 1);
2603
2604 rc = internal_dfa_exec(
2605 md, /* fixed match data */
2606 asscode, /* this subexpression's code */
2607 ptr, /* where we currently are */
2608 (int)(ptr - start_subject), /* start offset */
2609 local_offsets, /* offset vector */
2610 sizeof(local_offsets)/sizeof(int), /* size of same */
2611 local_workspace, /* workspace vector */
2612 sizeof(local_workspace)/sizeof(int), /* size of same */
2613 rlevel); /* function recursion level */
2614
2615 if (rc == PCRE_ERROR_DFA_UITEM) return rc;
2616 if ((rc >= 0) ==
2617 (condcode == OP_ASSERT || condcode == OP_ASSERTBACK))
2618 { ADD_ACTIVE((int)(endasscode + LINK_SIZE + 1 - start_code), 0); }
2619 else
2620 { ADD_ACTIVE(state_offset + codelink + LINK_SIZE + 1, 0); }
2621 }
2622 }
2623 break;
2624
2625 /*-----------------------------------------------------------------*/
2626 case OP_RECURSE:
2627 {
2628 dfa_recursion_info *ri;
2629 int local_offsets[1000];
2630 int local_workspace[1000];
2631 const uschar *callpat = start_code + GET(code, 1);
2632 int recno = (callpat == md->start_code)? 0 :
2633 GET2(callpat, 1 + LINK_SIZE);
2634 int rc;
2635
2636 DPRINTF(("%.*sStarting regex recursion\n", rlevel*2-2, SP));
2637
2638 /* Check for repeating a recursion without advancing the subject
2639 pointer. This should catch convoluted mutual recursions. (Some simple
2640 cases are caught at compile time.) */
2641
2642 for (ri = md->recursive; ri != NULL; ri = ri->prevrec)
2643 if (recno == ri->group_num && ptr == ri->subject_position)
2644 return PCRE_ERROR_RECURSELOOP;
2645
2646 /* Remember this recursion and where we started it so as to
2647 catch infinite loops. */
2648
2649 new_recursive.group_num = recno;
2650 new_recursive.subject_position = ptr;
2651 new_recursive.prevrec = md->recursive;
2652 md->recursive = &new_recursive;
2653
2654 rc = internal_dfa_exec(
2655 md, /* fixed match data */
2656 callpat, /* this subexpression's code */
2657 ptr, /* where we currently are */
2658 (int)(ptr - start_subject), /* start offset */
2659 local_offsets, /* offset vector */
2660 sizeof(local_offsets)/sizeof(int), /* size of same */
2661 local_workspace, /* workspace vector */
2662 sizeof(local_workspace)/sizeof(int), /* size of same */
2663 rlevel); /* function recursion level */
2664
2665 md->recursive = new_recursive.prevrec; /* Done this recursion */
2666
2667 DPRINTF(("%.*sReturn from regex recursion: rc=%d\n", rlevel*2-2, SP,
2668 rc));
2669
2670 /* Ran out of internal offsets */
2671
2672 if (rc == 0) return PCRE_ERROR_DFA_RECURSE;
2673
2674 /* For each successful matched substring, set up the next state with a
2675 count of characters to skip before trying it. Note that the count is in
2676 characters, not bytes. */
2677
2678 if (rc > 0)
2679 {
2680 for (rc = rc*2 - 2; rc >= 0; rc -= 2)
2681 {
2682 const uschar *p = start_subject + local_offsets[rc];
2683 const uschar *pp = start_subject + local_offsets[rc+1];
2684 int charcount = local_offsets[rc+1] - local_offsets[rc];
2685 while (p < pp) if ((*p++ & 0xc0) == 0x80) charcount--;
2686 if (charcount > 0)
2687 {
2688 ADD_NEW_DATA(-(state_offset + LINK_SIZE + 1), 0, (charcount - 1));
2689 }
2690 else
2691 {
2692 ADD_ACTIVE(state_offset + LINK_SIZE + 1, 0);
2693 }
2694 }
2695 }
2696 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2697 }
2698 break;
2699
2700 /*-----------------------------------------------------------------*/
2701 case OP_BRAPOS:
2702 case OP_SBRAPOS:
2703 case OP_CBRAPOS:
2704 case OP_SCBRAPOS:
2705 case OP_BRAPOSZERO:
2706 {
2707 int charcount, matched_count;
2708 const uschar *local_ptr = ptr;
2709 BOOL allow_zero;
2710
2711 if (codevalue == OP_BRAPOSZERO)
2712 {
2713 allow_zero = TRUE;
2714 codevalue = *(++code); /* Codevalue will be one of above BRAs */
2715 }
2716 else allow_zero = FALSE;
2717
2718 /* Loop to match the subpattern as many times as possible as if it were
2719 a complete pattern. */
2720
2721 for (matched_count = 0;; matched_count++)
2722 {
2723 int local_offsets[2];
2724 int local_workspace[1000];
2725
2726 int rc = internal_dfa_exec(
2727 md, /* fixed match data */
2728 code, /* this subexpression's code */
2729 local_ptr, /* where we currently are */
2730 (int)(ptr - start_subject), /* start offset */
2731 local_offsets, /* offset vector */
2732 sizeof(local_offsets)/sizeof(int), /* size of same */
2733 local_workspace, /* workspace vector */
2734 sizeof(local_workspace)/sizeof(int), /* size of same */
2735 rlevel); /* function recursion level */
2736
2737 /* Failed to match */
2738
2739 if (rc < 0)
2740 {
2741 if (rc != PCRE_ERROR_NOMATCH) return rc;
2742 break;
2743 }
2744
2745 /* Matched: break the loop if zero characters matched. */
2746
2747 charcount = local_offsets[1] - local_offsets[0];
2748 if (charcount == 0) break;
2749 local_ptr += charcount; /* Advance temporary position ptr */
2750 }
2751
2752 /* At this point we have matched the subpattern matched_count
2753 times, and local_ptr is pointing to the character after the end of the
2754 last match. */
2755
2756 if (matched_count > 0 || allow_zero)
2757 {
2758 const uschar *end_subpattern = code;
2759 int next_state_offset;
2760
2761 do { end_subpattern += GET(end_subpattern, 1); }
2762 while (*end_subpattern == OP_ALT);
2763 next_state_offset =
2764 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2765
2766 /* Optimization: if there are no more active states, and there
2767 are no new states yet set up, then skip over the subject string
2768 right here, to save looping. Otherwise, set up the new state to swing
2769 into action when the end of the matched substring is reached. */
2770
2771 if (i + 1 >= active_count && new_count == 0)
2772 {
2773 ptr = local_ptr;
2774 clen = 0;
2775 ADD_NEW(next_state_offset, 0);
2776 }
2777 else
2778 {
2779 const uschar *p = ptr;
2780 const uschar *pp = local_ptr;
2781 charcount = pp - p;
2782 while (p < pp) if ((*p++ & 0xc0) == 0x80) charcount--;
2783 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2784 }
2785 }
2786 }
2787 break;
2788
2789 /*-----------------------------------------------------------------*/
2790 case OP_ONCE:
2791 {
2792 int local_offsets[2];
2793 int local_workspace[1000];
2794
2795 int rc = internal_dfa_exec(
2796 md, /* fixed match data */
2797 code, /* this subexpression's code */
2798 ptr, /* where we currently are */
2799 (int)(ptr - start_subject), /* start offset */
2800 local_offsets, /* offset vector */
2801 sizeof(local_offsets)/sizeof(int), /* size of same */
2802 local_workspace, /* workspace vector */
2803 sizeof(local_workspace)/sizeof(int), /* size of same */
2804 rlevel); /* function recursion level */
2805
2806 if (rc >= 0)
2807 {
2808 const uschar *end_subpattern = code;
2809 int charcount = local_offsets[1] - local_offsets[0];
2810 int next_state_offset, repeat_state_offset;
2811
2812 do { end_subpattern += GET(end_subpattern, 1); }
2813 while (*end_subpattern == OP_ALT);
2814 next_state_offset =
2815 (int)(end_subpattern - start_code + LINK_SIZE + 1);
2816
2817 /* If the end of this subpattern is KETRMAX or KETRMIN, we must
2818 arrange for the repeat state also to be added to the relevant list.
2819 Calculate the offset, or set -1 for no repeat. */
2820
2821 repeat_state_offset = (*end_subpattern == OP_KETRMAX ||
2822 *end_subpattern == OP_KETRMIN)?
2823 (int)(end_subpattern - start_code - GET(end_subpattern, 1)) : -1;
2824
2825 /* If we have matched an empty string, add the next state at the
2826 current character pointer. This is important so that the duplicate
2827 checking kicks in, which is what breaks infinite loops that match an
2828 empty string. */
2829
2830 if (charcount == 0)
2831 {
2832 ADD_ACTIVE(next_state_offset, 0);
2833 }
2834
2835 /* Optimization: if there are no more active states, and there
2836 are no new states yet set up, then skip over the subject string
2837 right here, to save looping. Otherwise, set up the new state to swing
2838 into action when the end of the matched substring is reached. */
2839
2840 else if (i + 1 >= active_count && new_count == 0)
2841 {
2842 ptr += charcount;
2843 clen = 0;
2844 ADD_NEW(next_state_offset, 0);
2845
2846 /* If we are adding a repeat state at the new character position,
2847 we must fudge things so that it is the only current state.
2848 Otherwise, it might be a duplicate of one we processed before, and
2849 that would cause it to be skipped. */
2850
2851 if (repeat_state_offset >= 0)
2852 {
2853 next_active_state = active_states;
2854 active_count = 0;
2855 i = -1;
2856 ADD_ACTIVE(repeat_state_offset, 0);
2857 }
2858 }
2859 else
2860 {
2861 const uschar *p = start_subject + local_offsets[0];
2862 const uschar *pp = start_subject + local_offsets[1];
2863 while (p < pp) if ((*p++ & 0xc0) == 0x80) charcount--;
2864 ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
2865 if (repeat_state_offset >= 0)
2866 { ADD_NEW_DATA(-repeat_state_offset, 0, (charcount - 1)); }
2867 }
2868 }
2869 else if (rc != PCRE_ERROR_NOMATCH) return rc;
2870 }
2871 break;
2872
2873
2874 /* ========================================================================== */
2875 /* Handle callouts */
2876
2877 case OP_CALLOUT:
2878 rrc = 0;
2879 if (pcre_callout != NULL)
2880 {
2881 pcre_callout_block cb;
2882 cb.version = 1; /* Version 1 of the callout block */
2883 cb.callout_number = code[1];
2884 cb.offset_vector = offsets;
2885 cb.subject = (PCRE_SPTR)start_subject;
2886 cb.subject_length = (int)(end_subject - start_subject);
2887 cb.start_match = (int)(current_subject - start_subject);
2888 cb.current_position = (int)(ptr - start_subject);
2889 cb.pattern_position = GET(code, 2);
2890 cb.next_item_length = GET(code, 2 + LINK_SIZE);
2891 cb.capture_top = 1;
2892 cb.capture_last = -1;
2893 cb.callout_data = md->callout_data;
2894 if ((rrc = (*pcre_callout)(&cb)) < 0) return rrc; /* Abandon */
2895 }
2896 if (rrc == 0)
2897 { ADD_ACTIVE(state_offset + _pcre_OP_lengths[OP_CALLOUT], 0); }
2898 break;
2899
2900
2901 /* ========================================================================== */
2902 default: /* Unsupported opcode */
2903 return PCRE_ERROR_DFA_UITEM;
2904 }
2905
2906 NEXT_ACTIVE_STATE: continue;
2907
2908 } /* End of loop scanning active states */
2909
2910 /* We have finished the processing at the current subject character. If no
2911 new states have been set for the next character, we have found all the
2912 matches that we are going to find. If we are at the top level and partial
2913 matching has been requested, check for appropriate conditions.
2914
2915 The "forced_ fail" variable counts the number of (*F) encountered for the
2916 character. If it is equal to the original active_count (saved in
2917 workspace[1]) it means that (*F) was found on every active state. In this
2918 case we don't want to give a partial match.
2919
2920 The "could_continue" variable is true if a state could have continued but
2921 for the fact that the end of the subject was reached. */
2922
2923 if (new_count <= 0)
2924 {
2925 if (rlevel == 1 && /* Top level, and */
2926 could_continue && /* Some could go on */
2927 forced_fail != workspace[1] && /* Not all forced fail & */
2928 ( /* either... */
2929 (md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
2930 || /* or... */
2931 ((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
2932 match_count < 0) /* no matches */
2933 ) && /* And... */
2934 ptr >= end_subject && /* Reached end of subject */
2935 ptr > md->start_used_ptr) /* Inspected non-empty string */
2936 {
2937 if (offsetcount >= 2)
2938 {
2939 offsets[0] = (int)(md->start_used_ptr - start_subject);
2940 offsets[1] = (int)(end_subject - start_subject);
2941 }
2942 match_count = PCRE_ERROR_PARTIAL;
2943 }
2944
2945 DPRINTF(("%.*sEnd of internal_dfa_exec %d: returning %d\n"
2946 "%.*s---------------------\n\n", rlevel*2-2, SP, rlevel, match_count,
2947 rlevel*2-2, SP));
2948 break; /* In effect, "return", but see the comment below */
2949 }
2950
2951 /* One or more states are active for the next character. */
2952
2953 ptr += clen; /* Advance to next subject character */
2954 } /* Loop to move along the subject string */
2955
2956 /* Control gets here from "break" a few lines above. We do it this way because
2957 if we use "return" above, we have compiler trouble. Some compilers warn if
2958 there's nothing here because they think the function doesn't return a value. On
2959 the other hand, if we put a dummy statement here, some more clever compilers
2960 complain that it can't be reached. Sigh. */
2961
2962 return match_count;
2963 }
2964
2965
2966
2967
2968 /*************************************************
2969 * Execute a Regular Expression - DFA engine *
2970 *************************************************/
2971
2972 /* This external function applies a compiled re to a subject string using a DFA
2973 engine. This function calls the internal function multiple times if the pattern
2974 is not anchored.
2975
2976 Arguments:
2977 argument_re points to the compiled expression
2978 extra_data points to extra data or is NULL
2979 subject points to the subject string
2980 length length of subject string (may contain binary zeros)
2981 start_offset where to start in the subject string
2982 options option bits
2983 offsets vector of match offsets
2984 offsetcount size of same
2985 workspace workspace vector
2986 wscount size of same
2987
2988 Returns: > 0 => number of match offset pairs placed in offsets
2989 = 0 => offsets overflowed; longest matches are present
2990 -1 => failed to match
2991 < -1 => some kind of unexpected problem
2992 */
2993
2994 PCRE_EXP_DEFN int PCRE_CALL_CONVENTION
2995 pcre_dfa_exec(const pcre *argument_re, const pcre_extra *extra_data,
2996 const char *subject, int length, int start_offset, int options, int *offsets,
2997 int offsetcount, int *workspace, int wscount)
2998 {
2999 real_pcre *re = (real_pcre *)argument_re;
3000 dfa_match_data match_block;
3001 dfa_match_data *md = &match_block;
3002 BOOL utf8, anchored, startline, firstline;
3003 const uschar *current_subject, *end_subject, *lcc;
3004
3005 pcre_study_data internal_study;
3006 const pcre_study_data *study = NULL;
3007 real_pcre internal_re;
3008
3009 const uschar *req_byte_ptr;
3010 const uschar *start_bits = NULL;
3011 BOOL first_byte_caseless = FALSE;
3012 BOOL req_byte_caseless = FALSE;
3013 int first_byte = -1;
3014 int req_byte = -1;
3015 int req_byte2 = -1;
3016 int newline;
3017
3018 /* Plausibility checks */
3019
3020 if ((options & ~PUBLIC_DFA_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
3021 if (re == NULL || subject == NULL || workspace == NULL ||
3022 (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
3023 if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
3024 if (wscount < 20) return PCRE_ERROR_DFA_WSSIZE;
3025 if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
3026
3027 /* We need to find the pointer to any study data before we test for byte
3028 flipping, so we scan the extra_data block first. This may set two fields in the
3029 match block, so we must initialize them beforehand. However, the other fields
3030 in the match block must not be set until after the byte flipping. */
3031
3032 md->tables = re->tables;
3033 md->callout_data = NULL;
3034
3035 if (extra_data != NULL)
3036 {
3037 unsigned int flags = extra_data->flags;
3038 if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
3039 study = (const pcre_study_data *)extra_data->study_data;
3040 if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0) return PCRE_ERROR_DFA_UMLIMIT;
3041 if ((flags & PCRE_EXTRA_MATCH_LIMIT_RECURSION) != 0)
3042 return PCRE_ERROR_DFA_UMLIMIT;
3043 if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
3044 md->callout_data = extra_data->callout_data;
3045 if ((flags & PCRE_EXTRA_TABLES) != 0)
3046 md->tables = extra_data->tables;
3047 }
3048
3049 /* Check that the first field in the block is the magic number. If it is not,
3050 test for a regex that was compiled on a host of opposite endianness. If this is
3051 the case, flipped values are put in internal_re and internal_study if there was
3052 study data too. */
3053
3054 if (re->magic_number != MAGIC_NUMBER)
3055 {
3056 re = _pcre_try_flipped(re, &internal_re, study, &internal_study);
3057 if (re == NULL) return PCRE_ERROR_BADMAGIC;
3058 if (study != NULL) study = &internal_study;
3059 }
3060
3061 /* Set some local values */
3062
3063 current_subject = (const unsigned char *)subject + start_offset;
3064 end_subject = (const unsigned char *)subject + length;
3065 req_byte_ptr = current_subject - 1;
3066
3067 #ifdef SUPPORT_UTF8
3068 utf8 = (re->options & PCRE_UTF8) != 0;
3069 #else
3070 utf8 = FALSE;
3071 #endif
3072
3073 anchored = (options & (PCRE_ANCHORED|PCRE_DFA_RESTART)) != 0 ||
3074 (re->options & PCRE_ANCHORED) != 0;
3075
3076 /* The remaining fixed data for passing around. */
3077
3078 md->start_code = (const uschar *)argument_re +
3079 re->name_table_offset + re->name_count * re->name_entry_size;
3080 md->start_subject = (const unsigned char *)subject;
3081 md->end_subject = end_subject;
3082 md->start_offset = start_offset;
3083 md->moptions = options;
3084 md->poptions = re->options;
3085
3086 /* If the BSR option is not set at match time, copy what was set
3087 at compile time. */
3088
3089 if ((md->moptions & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) == 0)
3090 {
3091 if ((re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)) != 0)
3092 md->moptions |= re->options & (PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE);
3093 #ifdef BSR_ANYCRLF
3094 else md->moptions |= PCRE_BSR_ANYCRLF;
3095 #endif
3096 }
3097
3098 /* Handle different types of newline. The three bits give eight cases. If
3099 nothing is set at run time, whatever was used at compile time applies. */
3100
3101 switch ((((options & PCRE_NEWLINE_BITS) == 0)? re->options : (pcre_uint32)options) &
3102 PCRE_NEWLINE_BITS)
3103 {
3104 case 0: newline = NEWLINE; break; /* Compile-time default */
3105 case PCRE_NEWLINE_CR: newline = CHAR_CR; break;
3106 case PCRE_NEWLINE_LF: newline = CHAR_NL; break;
3107 case PCRE_NEWLINE_CR+
3108 PCRE_NEWLINE_LF: newline = (CHAR_CR << 8) | CHAR_NL; break;
3109 case PCRE_NEWLINE_ANY: newline = -1; break;
3110 case PCRE_NEWLINE_ANYCRLF: newline = -2; break;
3111 default: return PCRE_ERROR_BADNEWLINE;
3112 }
3113
3114 if (newline == -2)
3115 {
3116 md->nltype = NLTYPE_ANYCRLF;
3117 }
3118 else if (newline < 0)
3119 {
3120 md->nltype = NLTYPE_ANY;
3121 }
3122 else
3123 {
3124 md->nltype = NLTYPE_FIXED;
3125 if (newline > 255)
3126 {
3127 md->nllen = 2;
3128 md->nl[0] = (newline >> 8) & 255;
3129 md->nl[1] = newline & 255;
3130 }
3131 else
3132 {
3133 md->nllen = 1;
3134 md->nl[0] = newline;
3135 }
3136 }
3137
3138 /* Check a UTF-8 string if required. Unfortunately there's no way of passing
3139 back the character offset. */
3140
3141 #ifdef SUPPORT_UTF8
3142 if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
3143 {
3144 int erroroffset;
3145 int errorcode = _pcre_valid_utf8((uschar *)subject, length, &erroroffset);
3146 if (errorcode != 0)
3147 {
3148 if (offsetcount >= 2)
3149 {
3150 offsets[0] = erroroffset;
3151 offsets[1] = errorcode;
3152 }
3153 return (errorcode <= PCRE_UTF8_ERR5 && (options & PCRE_PARTIAL_HARD) != 0)?
3154 PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
3155 }
3156 if (start_offset > 0 && start_offset < length &&
3157 (((USPTR)subject)[start_offset] & 0xc0) == 0x80)
3158 return PCRE_ERROR_BADUTF8_OFFSET;
3159 }
3160 #endif
3161
3162 /* If the exec call supplied NULL for tables, use the inbuilt ones. This
3163 is a feature that makes it possible to save compiled regex and re-use them
3164 in other programs later. */
3165
3166 if (md->tables == NULL) md->tables = _pcre_default_tables;
3167
3168 /* The lower casing table and the "must be at the start of a line" flag are
3169 used in a loop when finding where to start. */
3170
3171 lcc = md->tables + lcc_offset;
3172 startline = (re->flags & PCRE_STARTLINE) != 0;
3173 firstline = (re->options & PCRE_FIRSTLINE) != 0;
3174
3175 /* Set up the first character to match, if available. The first_byte value is
3176 never set for an anchored regular expression, but the anchoring may be forced
3177 at run time, so we have to test for anchoring. The first char may be unset for
3178 an unanchored pattern, of course. If there's no first char and the pattern was
3179 studied, there may be a bitmap of possible first characters. */
3180
3181 if (!anchored)
3182 {
3183 if ((re->flags & PCRE_FIRSTSET) != 0)
3184 {
3185 first_byte = re->first_byte & 255;
3186 if ((first_byte_caseless = ((re->first_byte & REQ_CASELESS) != 0)) == TRUE)
3187 first_byte = lcc[first_byte];
3188 }
3189 else
3190 {
3191 if (!startline && study != NULL &&
3192 (study->flags & PCRE_STUDY_MAPPED) != 0)
3193 start_bits = study->start_bits;
3194 }
3195 }
3196
3197 /* For anchored or unanchored matches, there may be a "last known required
3198 character" set. */
3199
3200 if ((re->flags & PCRE_REQCHSET) != 0)
3201 {
3202 req_byte = re->req_byte & 255;
3203 req_byte_caseless = (re->req_byte & REQ_CASELESS) != 0;
3204 req_byte2 = (md->tables + fcc_offset)[req_byte]; /* case flipped */
3205 }
3206
3207 /* Call the main matching function, looping for a non-anchored regex after a
3208 failed match. If not restarting, perform certain optimizations at the start of
3209 a match. */
3210
3211 for (;;)
3212 {
3213 int rc;
3214
3215 if ((options & PCRE_DFA_RESTART) == 0)
3216 {
3217 const uschar *save_end_subject = end_subject;
3218
3219 /* If firstline is TRUE, the start of the match is constrained to the first
3220 line of a multiline string. Implement this by temporarily adjusting
3221 end_subject so that we stop scanning at a newline. If the match fails at
3222 the newline, later code breaks this loop. */
3223
3224 if (firstline)
3225 {
3226 USPTR t = current_subject;
3227 #ifdef SUPPORT_UTF8
3228 if (utf8)
3229 {
3230 while (t < md->end_subject && !IS_NEWLINE(t))
3231 {
3232 t++;
3233 while (t < end_subject && (*t & 0xc0) == 0x80) t++;
3234 }
3235 }
3236 else
3237 #endif
3238 while (t < md->end_subject && !IS_NEWLINE(t)) t++;
3239 end_subject = t;
3240 }
3241
3242 /* There are some optimizations that avoid running the match if a known
3243 starting point is not found. However, there is an option that disables
3244 these, for testing and for ensuring that all callouts do actually occur.
3245 The option can be set in the regex by (*NO_START_OPT) or passed in
3246 match-time options. */
3247
3248 if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
3249 {
3250 /* Advance to a known first byte. */
3251
3252 if (first_byte >= 0)
3253 {
3254 if (first_byte_caseless)
3255 while (current_subject < end_subject &&
3256 lcc[*current_subject] != first_byte)
3257 current_subject++;
3258 else
3259 while (current_subject < end_subject &&
3260 *current_subject != first_byte)
3261 current_subject++;
3262 }
3263
3264 /* Or to just after a linebreak for a multiline match if possible */
3265
3266 else if (startline)
3267 {
3268 if (current_subject > md->start_subject + start_offset)
3269 {
3270 #ifdef SUPPORT_UTF8
3271 if (utf8)
3272 {
3273 while (current_subject < end_subject &&
3274 !WAS_NEWLINE(current_subject))
3275 {
3276 current_subject++;
3277 while(current_subject < end_subject &&
3278 (*current_subject & 0xc0) == 0x80)
3279 current_subject++;
3280 }
3281 }
3282 else
3283 #endif
3284 while (current_subject < end_subject && !WAS_NEWLINE(current_subject))
3285 current_subject++;
3286
3287 /* If we have just passed a CR and the newline option is ANY or
3288 ANYCRLF, and we are now at a LF, advance the match position by one
3289 more character. */
3290
3291 if (current_subject[-1] == CHAR_CR &&
3292 (md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
3293 current_subject < end_subject &&
3294 *current_subject == CHAR_NL)
3295 current_subject++;
3296 }
3297 }
3298
3299 /* Or to a non-unique first char after study */
3300
3301 else if (start_bits != NULL)
3302 {
3303 while (current_subject < end_subject)
3304 {
3305 register unsigned int c = *current_subject;
3306 if ((start_bits[c/8] & (1 << (c&7))) == 0)
3307 {
3308 current_subject++;
3309 #ifdef SUPPORT_UTF8
3310 if (utf8)
3311 while(current_subject < end_subject &&
3312 (*current_subject & 0xc0) == 0x80) current_subject++;
3313 #endif
3314 }
3315 else break;
3316 }
3317 }
3318 }
3319
3320 /* Restore fudged end_subject */
3321
3322 end_subject = save_end_subject;
3323
3324 /* The following two optimizations are disabled for partial matching or if
3325 disabling is explicitly requested (and of course, by the test above, this
3326 code is not obeyed when restarting after a partial match). */
3327
3328 if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
3329 (options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
3330 {
3331 /* If the pattern was studied, a minimum subject length may be set. This
3332 is a lower bound; no actual string of that length may actually match the
3333 pattern. Although the value is, strictly, in characters, we treat it as
3334 bytes to avoid spending too much time in this optimization. */
3335
3336 if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
3337 (pcre_uint32)(end_subject - current_subject) < study->minlength)
3338 return PCRE_ERROR_NOMATCH;
3339
3340 /* If req_byte is set, we know that that character must appear in the
3341 subject for the match to succeed. If the first character is set, req_byte
3342 must be later in the subject; otherwise the test starts at the match
3343 point. This optimization can save a huge amount of work in patterns with
3344 nested unlimited repeats that aren't going to match. Writing separate
3345 code for cased/caseless versions makes it go faster, as does using an
3346 autoincrement and backing off on a match.
3347
3348 HOWEVER: when the subject string is very, very long, searching to its end
3349 can take a long time, and give bad performance on quite ordinary
3350 patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
3351 string... so we don't do this when the string is sufficiently long. */
3352
3353 if (req_byte >= 0 && end_subject - current_subject < REQ_BYTE_MAX)
3354 {
3355 register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
3356
3357 /* We don't need to repeat the search if we haven't yet reached the
3358 place we found it at last time. */
3359
3360 if (p > req_byte_ptr)
3361 {
3362 if (req_byte_caseless)
3363 {
3364 while (p < end_subject)
3365 {
3366 register int pp = *p++;
3367 if (pp == req_byte || pp == req_byte2) { p--; break; }
3368 }
3369 }
3370 else
3371 {
3372 while (p < end_subject)
3373 {
3374 if (*p++ == req_byte) { p--; break; }
3375 }
3376 }
3377
3378 /* If we can't find the required character, break the matching loop,
3379 which will cause a return or PCRE_ERROR_NOMATCH. */
3380
3381 if (p >= end_subject) break;
3382
3383 /* If we have found the required character, save the point where we
3384 found it, so that we don't search again next time round the loop if
3385 the start hasn't passed this character yet. */
3386
3387 req_byte_ptr = p;
3388 }
3389 }
3390 }
3391 } /* End of optimizations that are done when not restarting */
3392
3393 /* OK, now we can do the business */
3394
3395 md->start_used_ptr = current_subject;
3396 md->recursive = NULL;
3397
3398 rc = internal_dfa_exec(
3399 md, /* fixed match data */
3400 md->start_code, /* this subexpression's code */
3401 current_subject, /* where we currently are */
3402 start_offset, /* start offset in subject */
3403 offsets, /* offset vector */
3404 offsetcount, /* size of same */
3405 workspace, /* workspace vector */
3406 wscount, /* size of same */
3407 0); /* function recurse level */
3408
3409 /* Anything other than "no match" means we are done, always; otherwise, carry
3410 on only if not anchored. */
3411
3412 if (rc != PCRE_ERROR_NOMATCH || anchored) return rc;
3413
3414 /* Advance to the next subject character unless we are at the end of a line
3415 and firstline is set. */
3416
3417 if (firstline && IS_NEWLINE(current_subject)) break;
3418 current_subject++;
3419 if (utf8)
3420 {
3421 while (current_subject < end_subject && (*current_subject & 0xc0) == 0x80)
3422 current_subject++;
3423 }
3424 if (current_subject > end_subject) break;
3425
3426 /* If we have just passed a CR and we are now at a LF, and the pattern does
3427 not contain any explicit matches for \r or \n, and the newline option is CRLF
3428 or ANY or ANYCRLF, advance the match position by one more character. */
3429
3430 if (current_subject[-1] == CHAR_CR &&
3431 current_subject < end_subject &&
3432 *current_subject == CHAR_NL &&
3433 (re->flags & PCRE_HASCRORLF) == 0 &&
3434 (md->nltype == NLTYPE_ANY ||
3435 md->nltype == NLTYPE_ANYCRLF ||
3436 md->nllen == 2))
3437 current_subject++;
3438
3439 } /* "Bumpalong" loop */
3440
3441 return PCRE_ERROR_NOMATCH;
3442 }
3443
3444 /* End of pcre_dfa_exec.c */

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5