/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 123 by ph10, Mon Mar 12 15:19:06 2007 UTC revision 197 by ph10, Tue Jul 31 10:50:18 2007 UTC
# Line 72  USER DOCUMENTATION Line 72  USER DOCUMENTATION
72         of searching. The sections are as follows:         of searching. The sections are as follows:
73    
74           pcre              this document           pcre              this document
75             pcre-config       show PCRE installation configuration information
76           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
77           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
78           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
# Line 113  LIMITATIONS Line 114  LIMITATIONS
114         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
115         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
116    
117           If  a  non-capturing subpattern with an unlimited repetition quantifier
118           can match an empty string, there is a limit of 1000 on  the  number  of
119           times  it  can  be  repeated while not matching an empty string - if it
120           does match an empty string, the loop is immediately broken.
121    
122         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
123         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
124    
125         The maximum length of a subject string is the largest  positive  number         The  maximum  length of a subject string is the largest positive number
126         that  an integer variable can hold. However, when using the traditional         that an integer variable can hold. However, when using the  traditional
127         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
128         inite  repetition.  This means that the available stack space may limit         inite repetition.  This means that the available stack space may  limit
129         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
130         For a discussion of stack issues, see the pcrestack documentation.         For a discussion of stack issues, see the pcrestack documentation.
131    
132    
133  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
134    
135         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
136         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
137         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
138         port for Unicode general category properties was added.         port for Unicode general category properties was added.
139    
140         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
143         any  subject  strings  that are matched against it are treated as UTF-8         any subject strings that are matched against it are  treated  as  UTF-8
144         strings instead of just strings of bytes.         strings instead of just strings of bytes.
145    
146         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
147         the  library will be a bit bigger, but the additional run time overhead         the library will be a bit bigger, but the additional run time  overhead
148         is limited to testing the PCRE_UTF8 flag occasionally, so should not be         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
149         very big.         very big.
150    
151         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
152         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
153         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
154         general category properties such as Lu for an upper case letter  or  Nd         general  category  properties such as Lu for an upper case letter or Nd
155         for  a  decimal number, the Unicode script names such as Arabic or Han,         for a decimal number, the Unicode script names such as Arabic  or  Han,
156         and the derived properties Any and L&. A full  list  is  given  in  the         and  the  derived  properties  Any  and L&. A full list is given in the
157         pcrepattern documentation. Only the short names for properties are sup-         pcrepattern documentation. Only the short names for properties are sup-
158         ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-         ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
159         ter},  is  not  supported.   Furthermore,  in Perl, many properties may         ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
160         optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE         optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
161         does not support this.         does not support this.
162    
163         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
164    
165         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
166         subjects are checked for validity on entry to the  relevant  functions.         subjects  are  checked for validity on entry to the relevant functions.
167         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
168         situations, you may already know  that  your  strings  are  valid,  and         situations,  you  may  already  know  that  your strings are valid, and
169         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
170         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
171         PCRE  assumes  that  the  pattern or subject it is given (respectively)         PCRE assumes that the pattern or subject  it  is  given  (respectively)
172         contains only valid UTF-8 codes. In this case, it does not diagnose  an         contains  only valid UTF-8 codes. In this case, it does not diagnose an
173         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
174         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
175         crash.         crash.
176    
177         2.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
178         two-byte UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
179    
180         3. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
181         characters for values greater than \177.         characters for values greater than \177.
182    
183         4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
184         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
185    
186         5. The dot metacharacter matches one UTF-8 character instead of a  sin-         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
187         gle byte.         gle byte.
188    
189         6.  The  escape sequence \C can be used to match a single byte in UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
190         mode, but its use can lead to some strange effects.  This  facility  is         mode,  but  its  use can lead to some strange effects. This facility is
191         not available in the alternative matching function, pcre_dfa_exec().         not available in the alternative matching function, pcre_dfa_exec().
192    
193         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
194         test characters of any code value, but the characters that PCRE  recog-         test  characters of any code value, but the characters that PCRE recog-
195         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes as digits, spaces, or word characters  remain  the  same  set  as
196         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
197         includes  Unicode  property support, because to do otherwise would slow         includes Unicode property support, because to do otherwise  would  slow
198         down PCRE in many common cases. If you really want to test for a  wider         down  PCRE in many common cases. If you really want to test for a wider
199         sense  of,  say,  "digit",  you must use Unicode property tests such as         sense of, say, "digit", you must use Unicode  property  tests  such  as
200         \p{Nd}.         \p{Nd}.
201    
202         8. Similarly, characters that match the POSIX named  character  classes         8.  Similarly,  characters that match the POSIX named character classes
203         are all low-valued characters.         are all low-valued characters.
204    
205         9.  Case-insensitive  matching  applies only to characters whose values         9. However, the Perl 5.10 horizontal and vertical  whitespace  matching
206         are less than 128, unless PCRE is built with Unicode property  support.         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
207         Even  when  Unicode  property support is available, PCRE still uses its         acters.
208         own character tables when checking the case of  low-valued  characters,  
209         so  as not to degrade performance.  The Unicode property information is         10. Case-insensitive matching applies only to characters  whose  values
210           are  less than 128, unless PCRE is built with Unicode property support.
211           Even when Unicode property support is available, PCRE  still  uses  its
212           own  character  tables when checking the case of low-valued characters,
213           so as not to degrade performance.  The Unicode property information  is
214         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Even when Unicode property
215         support is available, PCRE supports case-insensitive matching only when         support is available, PCRE supports case-insensitive matching only when
216         there is a one-to-one mapping between a letter's  cases.  There  are  a         there  is  a  one-to-one  mapping between a letter's cases. There are a
217         small  number  of  many-to-one  mappings in Unicode; these are not sup-         small number of many-to-one mappings in Unicode;  these  are  not  sup-
218         ported by PCRE.         ported by PCRE.
219    
220    
# Line 214  AUTHOR Line 224  AUTHOR
224         University Computing Service         University Computing Service
225         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
226    
227         Putting an actual email address here seems to have been a spam  magnet,         Putting  an actual email address here seems to have been a spam magnet,
228         so I've taken it away. If you want to email me, use my initial and sur-         so I've taken it away. If you want to email me, use  my  two  initials,
229         name, separated by a dot, at the domain ucs.cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
230    
231    
232  REVISION  REVISION
233    
234         Last updated: 06 March 2007         Last updated: 30 July 2007
235         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
236  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
237    
# Line 244  PCRE BUILD-TIME OPTIONS Line 254  PCRE BUILD-TIME OPTIONS
254    
255           ./configure --help           ./configure --help
256    
257         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
258         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
259         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
260         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
261         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
262         not described.         is not described.
263    
264    
265  C++ SUPPORT  C++ SUPPORT
# Line 288  UNICODE CHARACTER PROPERTY SUPPORT Line 298  UNICODE CHARACTER PROPERTY SUPPORT
298         to the configure command. This implies UTF-8 support, even if you  have         to the configure command. This implies UTF-8 support, even if you  have
299         not explicitly requested it.         not explicitly requested it.
300    
301         Including  Unicode  property  support  adds around 90K of tables to the         Including  Unicode  property  support  adds around 30K of tables to the
302         PCRE library, approximately doubling its size. Only the  general  cate-         PCRE library. Only the general category properties such as  Lu  and  Nd
303         gory  properties  such as Lu and Nd are supported. Details are given in         are supported. Details are given in the pcrepattern documentation.
        the pcrepattern documentation.  
304    
305    
306  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
307    
308         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating
309         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
310         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use character 13 (carriage return, CR)
311         instead, by adding         instead, by adding
312    
313           --enable-newline-is-cr           --enable-newline-is-cr
314    
315         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
316         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
317    
318         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 313  CODE VALUE OF NEWLINE Line 322  CODE VALUE OF NEWLINE
322    
323         to the configure command. There is a fourth option, specified by         to the configure command. There is a fourth option, specified by
324    
325             --enable-newline-is-anycrlf
326    
327           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
328           CRLF as indicating a line ending. Finally, a fifth option, specified by
329    
330           --enable-newline-is-any           --enable-newline-is-any
331    
332         which causes PCRE to recognize any Unicode newline sequence.         causes PCRE to recognize any Unicode newline sequence.
333    
334         Whatever  line  ending convention is selected when PCRE is built can be         Whatever line ending convention is selected when PCRE is built  can  be
335         overridden when the library functions are called. At build time  it  is         overridden  when  the library functions are called. At build time it is
336         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
337    
338    
339  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
340    
341         The  PCRE building process uses libtool to build both shared and static         The PCRE building process uses libtool to build both shared and  static
342         Unix libraries by default. You can suppress one of these by adding  one         Unix  libraries by default. You can suppress one of these by adding one
343         of         of
344    
345           --disable-shared           --disable-shared
# Line 337  BUILDING SHARED AND STATIC LIBRARIES Line 351  BUILDING SHARED AND STATIC LIBRARIES
351  POSIX MALLOC USAGE  POSIX MALLOC USAGE
352    
353         When PCRE is called through the POSIX interface (see the pcreposix doc-         When PCRE is called through the POSIX interface (see the pcreposix doc-
354         umentation), additional working storage is  required  for  holding  the         umentation),  additional  working  storage  is required for holding the
355         pointers  to capturing substrings, because PCRE requires three integers         pointers to capturing substrings, because PCRE requires three  integers
356         per substring, whereas the POSIX interface provides only  two.  If  the         per  substring,  whereas  the POSIX interface provides only two. If the
357         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
358         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
359         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 352  POSIX MALLOC USAGE Line 366  POSIX MALLOC USAGE
366    
367  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
368    
369         Within a compiled pattern, offset values are used  to  point  from  one         Within  a  compiled  pattern,  offset values are used to point from one
370         part  to another (for example, from an opening parenthesis to an alter-         part to another (for example, from an opening parenthesis to an  alter-
371         nation metacharacter). By default, two-byte values are used  for  these         nation  metacharacter).  By default, two-byte values are used for these
372         offsets,  leading  to  a  maximum size for a compiled pattern of around         offsets, leading to a maximum size for a  compiled  pattern  of  around
373         64K. This is sufficient to handle all but the most  gigantic  patterns.         64K.  This  is sufficient to handle all but the most gigantic patterns.
374         Nevertheless,  some  people do want to process enormous patterns, so it         Nevertheless, some people do want to process enormous patterns,  so  it
375         is possible to compile PCRE to use three-byte or four-byte  offsets  by         is  possible  to compile PCRE to use three-byte or four-byte offsets by
376         adding a setting such as         adding a setting such as
377    
378           --with-link-size=3           --with-link-size=3
379    
380         to  the  configure  command.  The value given must be 2, 3, or 4. Using         to the configure command. The value given must be 2,  3,  or  4.  Using
381         longer offsets slows down the operation of PCRE because it has to  load         longer  offsets slows down the operation of PCRE because it has to load
382         additional bytes when handling them.         additional bytes when handling them.
383    
        If  you  build  PCRE with an increased link size, test 2 (and test 5 if  
        you are using UTF-8) will fail. Part of the output of these tests is  a  
        representation  of the compiled pattern, and this changes with the link  
        size.  
   
384    
385  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
386    
# Line 390  AVOIDING EXCESSIVE STACK USAGE Line 399  AVOIDING EXCESSIVE STACK USAGE
399    
400         to  the  configure  command. With this configuration, PCRE will use the         to  the  configure  command. With this configuration, PCRE will use the
401         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
402         ment  functions.  Separate  functions are provided because the usage is         ment  functions. By default these point to malloc() and free(), but you
403         very predictable: the block sizes requested are always  the  same,  and         can replace the pointers so that your own functions are used.
404         the  blocks  are always freed in reverse order. A calling program might  
405         be able to implement optimized functions that perform better  than  the         Separate functions are  provided  rather  than  using  pcre_malloc  and
406         standard  malloc()  and  free()  functions.  PCRE  runs noticeably more         pcre_free  because  the  usage  is  very  predictable:  the block sizes
407         slowly when built in this way. This option affects only the pcre_exec()         requested are always the same, and  the  blocks  are  always  freed  in
408         function; it is not relevant for the the pcre_dfa_exec() function.         reverse  order.  A calling program might be able to implement optimized
409           functions that perform better  than  malloc()  and  free().  PCRE  runs
410           noticeably more slowly when built in this way. This option affects only
411           the  pcre_exec()  function;  it   is   not   relevant   for   the   the
412           pcre_dfa_exec() function.
413    
414    
415  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
# Line 429  LIMITING PCRE RESOURCE USAGE Line 442  LIMITING PCRE RESOURCE USAGE
442         time.         time.
443    
444    
445    CREATING CHARACTER TABLES AT BUILD TIME
446    
447           PCRE uses fixed tables for processing characters whose code values  are
448           less  than 256. By default, PCRE is built with a set of tables that are
449           distributed in the file pcre_chartables.c.dist. These  tables  are  for
450           ASCII codes only. If you add
451    
452             --enable-rebuild-chartables
453    
454           to  the  configure  command, the distributed tables are no longer used.
455           Instead, a program called dftables is compiled and  run.  This  outputs
456           the source for new set of tables, created in the default locale of your
457           C runtime system. (This method of replacing the tables does not work if
458           you  are cross compiling, because dftables is run on the local host. If
459           you need to create alternative tables when cross  compiling,  you  will
460           have to do so "by hand".)
461    
462    
463  USING EBCDIC CODE  USING EBCDIC CODE
464    
465         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
466         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
467         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         This  is  the  case for most computer operating systems. PCRE can, how-
468         adding         ever, be compiled to run in an EBCDIC environment by adding
469    
470           --enable-ebcdic           --enable-ebcdic
471    
472         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
473           bles.  You  should  only  use  it if you know that you are in an EBCDIC
474           environment (for example, an IBM mainframe operating system).
475    
476    
477  SEE ALSO  SEE ALSO
# Line 455  AUTHOR Line 488  AUTHOR
488    
489  REVISION  REVISION
490    
491         Last updated: 06 March 2007         Last updated: 30 July 2007
492         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
493  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
494    
# Line 508  REGULAR EXPRESSIONS AS TREES Line 541  REGULAR EXPRESSIONS AS TREES
541    
542  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
543    
544         In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
545         sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
546         depth-first search of the pattern tree. That is, it  proceeds  along  a         depth-first search of the pattern tree. That is, it  proceeds  along  a
547         single path through the tree, checking that the subject matches what is         single path through the tree, checking that the subject matches what is
548         required. When there is a mismatch, the algorithm  tries  any  alterna-         required. When there is a mismatch, the algorithm  tries  any  alterna-
# Line 591  THE ALTERNATIVE MATCHING ALGORITHM Line 624  THE ALTERNATIVE MATCHING ALGORITHM
624         ence  as  the  condition or test for a specific group recursion are not         ence  as  the  condition or test for a specific group recursion are not
625         supported.         supported.
626    
627         5. Callouts are supported, but the value of the  capture_top  field  is         5. Because many paths through the tree may be  active,  the  \K  escape
628           sequence, which resets the start of the match when encountered (but may
629           be on some paths and not on others), is not  supported.  It  causes  an
630           error if encountered.
631    
632           6.  Callouts  are  supported, but the value of the capture_top field is
633         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
634    
635         6.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
636         single byte, even in UTF-8 mode, is not supported because the  alterna-         single  byte, even in UTF-8 mode, is not supported because the alterna-
637         tive  algorithm  moves  through  the  subject string one character at a         tive algorithm moves through the subject  string  one  character  at  a
638         time, for all active paths through the tree.         time, for all active paths through the tree.
639    
640    
641  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
642    
643         Using the alternative matching algorithm provides the following  advan-         Using  the alternative matching algorithm provides the following advan-
644         tages:         tages:
645    
646         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
647         ically found, and in particular, the longest match is  found.  To  find         ically  found,  and  in particular, the longest match is found. To find
648         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
649         things with callouts.         things with callouts.
650    
651         2. There is much better support for partial matching. The  restrictions         2.  There is much better support for partial matching. The restrictions
652         on  the content of the pattern that apply when using the standard algo-         on the content of the pattern that apply when using the standard  algo-
653         rithm for partial matching do not apply to the  alternative  algorithm.         rithm  for  partial matching do not apply to the alternative algorithm.
654         For  non-anchored patterns, the starting position of a partial match is         For non-anchored patterns, the starting position of a partial match  is
655         available.         available.
656    
657         3. Because the alternative algorithm  scans  the  subject  string  just         3.  Because  the  alternative  algorithm  scans the subject string just
658         once,  and  never  needs to backtrack, it is possible to pass very long         once, and never needs to backtrack, it is possible to  pass  very  long
659         subject strings to the matching function in  several  pieces,  checking         subject  strings  to  the matching function in several pieces, checking
660         for partial matching each time.         for partial matching each time.
661    
662    
# Line 626  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 664  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
664    
665         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
666    
667         1.  It  is  substantially  slower  than the standard algorithm. This is         1. It is substantially slower than  the  standard  algorithm.  This  is
668         partly because it has to search for all possible matches, but  is  also         partly  because  it has to search for all possible matches, but is also
669         because it is less susceptible to optimization.         because it is less susceptible to optimization.
670    
671         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 645  AUTHOR Line 683  AUTHOR
683    
684  REVISION  REVISION
685    
686         Last updated: 06 March 2007         Last updated: 29 May 2007
687         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
688  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
689    
# Line 828  PCRE API OVERVIEW Line 866  PCRE API OVERVIEW
866    
867  NEWLINES  NEWLINES
868    
869         PCRE  supports four different conventions for indicating line breaks in         PCRE  supports five different conventions for indicating line breaks in
870         strings: a single CR (carriage return) character, a  single  LF  (line-         strings: a single CR (carriage return) character, a  single  LF  (line-
871         feed)  character,  the two-character sequence CRLF, or any Unicode new-         feed) character, the two-character sequence CRLF, any of the three pre-
872         line sequence.  The Unicode newline sequences are the three  just  men-         ceding, or any Unicode newline sequence. The Unicode newline  sequences
873         tioned, plus the single characters VT (vertical tab, U+000B), FF (form-         are  the  three just mentioned, plus the single characters VT (vertical
874         feed, U+000C), NEL (next line, U+0085), LS  (line  separator,  U+2028),         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
875         and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
876    
877         Each  of  the first three conventions is used by at least one operating         Each  of  the first three conventions is used by at least one operating
878         system as its standard newline sequence. When PCRE is built, a  default         system as its standard newline sequence. When PCRE is built, a  default
# Line 868  SAVING PRECOMPILED PATTERNS FOR LATER US Line 906  SAVING PRECOMPILED PATTERNS FOR LATER US
906         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
907         later time, possibly by a different program, and even on a  host  other         later time, possibly by a different program, and even on a  host  other
908         than  the  one  on  which  it  was  compiled.  Details are given in the         than  the  one  on  which  it  was  compiled.  Details are given in the
909         pcreprecompile documentation.         pcreprecompile documentation. However, compiling a  regular  expression
910           with  one version of PCRE for use with a different version is not guar-
911           anteed to work and may cause crashes.
912    
913    
914  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
# Line 899  CHECKING BUILD-TIME OPTIONS Line 939  CHECKING BUILD-TIME OPTIONS
939    
940         The output is an integer whose value specifies  the  default  character         The output is an integer whose value specifies  the  default  character
941         sequence  that is recognized as meaning "newline". The four values that         sequence  that is recognized as meaning "newline". The four values that
942         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY.         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
943         The default should normally be the standard sequence for your operating         and  -1  for  ANY. The default should normally be the standard sequence
944         system.         for your operating system.
945    
946           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
947    
# Line 1125  COMPILING A PATTERN Line 1165  COMPILING A PATTERN
1165           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1166           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1167           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1168             PCRE_NEWLINE_ANYCRLF
1169           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1170    
1171         These  options  override the default newline definition that was chosen         These  options  override the default newline definition that was chosen
1172         when PCRE was built. Setting the first or the second specifies  that  a         when PCRE was built. Setting the first or the second specifies  that  a
1173         newline  is  indicated  by a single character (CR or LF, respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1174         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1175         two-character  CRLF  sequence.  Setting PCRE_NEWLINE_ANY specifies that         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1176         any Unicode newline sequence should be recognized. The Unicode  newline         that any of the three preceding sequences should be recognized. Setting
1177         sequences  are  the three just mentioned, plus the single characters VT         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1178         (vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085),         recognized. The Unicode newline sequences are the three just mentioned,
1179         LS  (line separator, U+2028), and PS (paragraph separator, U+2029). The         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1180         last two are recognized only in UTF-8 mode.         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1181           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1182           UTF-8 mode.
1183    
1184         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
1185         treated  as  a  number, giving eight possibilities. Currently only five         treated as a number, giving eight possibilities. Currently only six are
1186         are used (default plus the four values above). This means that  if  you         used (default plus the five values above). This means that if  you  set
1187         set  more  than  one  newline option, the combination may or may not be         more  than one newline option, the combination may or may not be sensi-
1188         sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is  equiva-         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1189         lent  to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1190         and cause an error.         cause an error.
1191    
1192         The only time that a line break is specially recognized when  compiling         The only time that a line break is specially recognized when  compiling
1193         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
# Line 1230  COMPILATION ERROR CODES Line 1273  COMPILATION ERROR CODES
1273           26  malformed number or name after (?(           26  malformed number or name after (?(
1274           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1275           28  assertion expected after (?(           28  assertion expected after (?(
1276           29  (?R or (?digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
1277           30  unknown POSIX class name           30  unknown POSIX class name
1278           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
1279           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is not compiled with PCRE_UTF8 support
# Line 1259  COMPILATION ERROR CODES Line 1302  COMPILATION ERROR CODES
1302           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
1303           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
1304           56  inconsistent NEWLINE options"           56  inconsistent NEWLINE options"
1305             57  \g is not followed by a braced name or an optionally braced
1306                   non-zero number
1307             58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1308    
1309    
1310  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1310  STUDYING A PATTERN Line 1356  STUDYING A PATTERN
1356  LOCALE SUPPORT  LOCALE SUPPORT
1357    
1358         PCRE handles caseless matching, and determines whether  characters  are         PCRE handles caseless matching, and determines whether  characters  are
1359         letters  digits,  or whatever, by reference to a set of tables, indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1360         by character value. When running in UTF-8 mode, this  applies  only  to         by character value. When running in UTF-8 mode, this  applies  only  to
1361         characters  with  codes  less than 128. Higher-valued codes never match         characters  with  codes  less than 128. Higher-valued codes never match
1362         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1363         with  Unicode  character property support. The use of locales with Uni-         with  Unicode  character property support. The use of locales with Uni-
1364         code is discouraged.         code is discouraged. If you are handling characters with codes  greater
1365           than  128, you should either use UTF-8 and Unicode, or use locales, but
1366         An internal set of tables is created in the default C locale when  PCRE         not try to mix the two.
1367         is  built.  This  is  used when the final argument of pcre_compile() is  
1368         NULL, and is sufficient for many applications. An  alternative  set  of         PCRE contains an internal set of tables that are used  when  the  final
1369         tables  can,  however, be supplied. These may be created in a different         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1370         locale from the default. As more and more applications change to  using         applications.  Normally, the internal tables recognize only ASCII char-
1371         Unicode, the need for this locale support is expected to die away.         acters. However, when PCRE is built, it is possible to cause the inter-
1372           nal tables to be rebuilt in the default "C" locale of the local system,
1373         External  tables  are  built by calling the pcre_maketables() function,         which may cause them to be different.
1374         which has no arguments, in the relevant locale. The result can then  be  
1375         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         The  internal tables can always be overridden by tables supplied by the
1376         example, to build and use tables that are appropriate  for  the  French         application that calls PCRE. These may be created in a different locale
1377         locale  (where  accented  characters  with  values greater than 128 are         from  the  default.  As more and more applications change to using Uni-
1378           code, the need for this locale support is expected to die away.
1379    
1380           External tables are built by calling  the  pcre_maketables()  function,
1381           which  has no arguments, in the relevant locale. The result can then be
1382           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1383           example,  to  build  and use tables that are appropriate for the French
1384           locale (where accented characters with  values  greater  than  128  are
1385         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1386    
1387           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1388           tables = pcre_maketables();           tables = pcre_maketables();
1389           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1390    
1391           The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1392           if you are using Windows, the name for the French locale is "french".
1393    
1394         When pcre_maketables() runs, the tables are built  in  memory  that  is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1395         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1396         that the memory containing the tables remains available for as long  as         that the memory containing the tables remains available for as long  as
# Line 1437  INFORMATION ABOUT A PATTERN Line 1493  INFORMATION ABOUT A PATTERN
1493         returned. The fourth argument should point to an unsigned char *  vari-         returned. The fourth argument should point to an unsigned char *  vari-
1494         able.         able.
1495    
1496             PCRE_INFO_JCHANGED
1497    
1498           Return  1  if the (?J) option setting is used in the pattern, otherwise
1499           0. The fourth argument should point to an int variable. The (?J) inter-
1500           nal option setting changes the local PCRE_DUPNAMES option.
1501    
1502           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1503    
1504         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1491  INFORMATION ABOUT A PATTERN Line 1553  INFORMATION ABOUT A PATTERN
1553         name-to-number map, remember that the length of the entries  is  likely         name-to-number map, remember that the length of the entries  is  likely
1554         to be different for each compiled pattern.         to be different for each compiled pattern.
1555    
1556             PCRE_INFO_OKPARTIAL
1557    
1558           Return  1 if the pattern can be used for partial matching, otherwise 0.
1559           The fourth argument should point to an int  variable.  The  pcrepartial
1560           documentation  lists  the restrictions that apply to patterns when par-
1561           tial matching is used.
1562    
1563           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1564    
1565         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1566         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1567         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1568         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1569           other words, they are the options that will be in force  when  matching
1570           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1571           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1572           and PCRE_EXTENDED.
1573    
1574         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1575         alternatives begin with one of the following:         alternatives begin with one of the following:
1576    
1577           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1512  INFORMATION ABOUT A PATTERN Line 1585  INFORMATION ABOUT A PATTERN
1585    
1586           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1587    
1588         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1589         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1590         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1591         size_t variable.         size_t variable.
# Line 1520  INFORMATION ABOUT A PATTERN Line 1593  INFORMATION ABOUT A PATTERN
1593           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1594    
1595         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1596         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1597         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1598         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1599         variable.         variable.
1600    
1601    
# Line 1530  OBSOLETE INFO FUNCTION Line 1603  OBSOLETE INFO FUNCTION
1603    
1604         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1605    
1606         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1607         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1608         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1609         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1610         lowing negative numbers:         lowing negative numbers:
1611    
1612           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1613           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1614    
1615         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1616         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1617         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1618    
1619         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1620         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1621         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1622    
1623    
# Line 1552  REFERENCE COUNTS Line 1625  REFERENCE COUNTS
1625    
1626         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1627    
1628         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1629         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1630         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1631         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1632         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1633    
1634         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1635         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1636         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1637         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1638         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1639         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1640    
1641         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1642         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1643         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1644    
1645    
# Line 1576  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1649  MATCHING A PATTERN: THE TRADITIONAL FUNC
1649              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1650              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1651    
1652         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1653         compiled pattern, which is passed in the code argument. If the  pattern         compiled  pattern, which is passed in the code argument. If the pattern
1654         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1655         argument. This function is the main matching facility of  the  library,         argument.  This  function is the main matching facility of the library,
1656         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1657         an alternative matching function, which is described below in the  sec-         an  alternative matching function, which is described below in the sec-
1658         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1659    
1660         In  most applications, the pattern will have been compiled (and option-         In most applications, the pattern will have been compiled (and  option-
1661         ally studied) in the same process that calls pcre_exec().  However,  it         ally  studied)  in the same process that calls pcre_exec(). However, it
1662         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1663         later in different processes, possibly even on different hosts.  For  a         later  in  different processes, possibly even on different hosts. For a
1664         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1665    
1666         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1606  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1679  MATCHING A PATTERN: THE TRADITIONAL FUNC
1679    
1680     Extra data for pcre_exec()     Extra data for pcre_exec()
1681    
1682         If  the  extra argument is not NULL, it must point to a pcre_extra data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1683         block. The pcre_study() function returns such a block (when it  doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1684         return  NULL), but you can also create one for yourself, and pass addi-         return NULL), but you can also create one for yourself, and pass  addi-
1685         tional information in it. The pcre_extra block contains  the  following         tional  information  in it. The pcre_extra block contains the following
1686         fields (not necessarily in this order):         fields (not necessarily in this order):
1687    
1688           unsigned long int flags;           unsigned long int flags;
# Line 1619  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1692  MATCHING A PATTERN: THE TRADITIONAL FUNC
1692           void *callout_data;           void *callout_data;
1693           const unsigned char *tables;           const unsigned char *tables;
1694    
1695         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1696         are set. The flag bits are:         are set. The flag bits are:
1697    
1698           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1628  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1701  MATCHING A PATTERN: THE TRADITIONAL FUNC
1701           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1702           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1703    
1704         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1705         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1706         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1707         add  to  the  block by setting the other fields and their corresponding         add to the block by setting the other fields  and  their  corresponding
1708         flag bits.         flag bits.
1709    
1710         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1711         a  vast amount of resources when running patterns that are not going to         a vast amount of resources when running patterns that are not going  to
1712         match, but which have a very large number  of  possibilities  in  their         match,  but  which  have  a very large number of possibilities in their
1713         search  trees.  The  classic  example  is  the  use of nested unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1714         repeats.         repeats.
1715    
1716         Internally, PCRE uses a function called match() which it calls  repeat-         Internally,  PCRE uses a function called match() which it calls repeat-
1717         edly  (sometimes  recursively). The limit set by match_limit is imposed         edly (sometimes recursively). The limit set by match_limit  is  imposed
1718         on the number of times this function is called during  a  match,  which         on  the  number  of times this function is called during a match, which
1719         has  the  effect  of  limiting the amount of backtracking that can take         has the effect of limiting the amount of  backtracking  that  can  take
1720         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1721         for each position in the subject string.         for each position in the subject string.
1722    
1723         The  default  value  for  the  limit can be set when PCRE is built; the         The default value for the limit can be set  when  PCRE  is  built;  the
1724         default default is 10 million, which handles all but the  most  extreme         default  default  is 10 million, which handles all but the most extreme
1725         cases.  You  can  override  the  default by suppling pcre_exec() with a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1726         pcre_extra    block    in    which    match_limit    is    set,     and         pcre_extra     block    in    which    match_limit    is    set,    and
1727         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1728         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1729    
1730         The match_limit_recursion field is similar to match_limit, but  instead         The  match_limit_recursion field is similar to match_limit, but instead
1731         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1732         the depth of recursion. The recursion depth is a  smaller  number  than         the  depth  of  recursion. The recursion depth is a smaller number than
1733         the  total number of calls, because not all calls to match() are recur-         the total number of calls, because not all calls to match() are  recur-
1734         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1735    
1736         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1737         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1738         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1739    
1740         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1741         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1742         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1743         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1744         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1745         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1746    
1747         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1748         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1749    
1750         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1751         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1752         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1753         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1754         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1755         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1756         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1757         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1758         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1759         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1760    
1761     Option bits for pcre_exec()     Option bits for pcre_exec()
1762    
1763         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1764         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1765         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1766         PCRE_PARTIAL.         PCRE_PARTIAL.
1767    
1768           PCRE_ANCHORED           PCRE_ANCHORED
1769    
1770         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1771         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1772         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1773         unachored at matching time.         unachored at matching time.
1774    
1775           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1776           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1777           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1778             PCRE_NEWLINE_ANYCRLF
1779           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
1780    
1781         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
1782         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
1783         tion of pcre_compile()  above.  During  matching,  the  newline  choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
1784         affects  the  behaviour  of the dot, circumflex, and dollar metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1785         ters. It may also alter the way the match position is advanced after  a         ters.  It may also alter the way the match position is advanced after a
1786         match  failure  for  an  unanchored  pattern. When PCRE_NEWLINE_CRLF or         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1787         PCRE_NEWLINE_ANY is set, and a match attempt  fails  when  the  current         PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt
1788         position  is  at a CRLF sequence, the match position is advanced by two         fails when the current position is at a CRLF sequence, the match  posi-
1789         characters instead of one, in other words, to after the CRLF.         tion  is  advanced by two characters instead of one, in other words, to
1790           after the CRLF.
1791    
1792           PCRE_NOTBOL           PCRE_NOTBOL
1793    
# Line 1985  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2060  MATCHING A PATTERN: THE TRADITIONAL FUNC
2060         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2061         description above.         description above.
2062    
          PCRE_ERROR_NULLWSLIMIT    (-22)  
   
        When a group that can match an empty  substring  is  repeated  with  an  
        unbounded  upper  limit, the subject position at the start of the group  
        must be remembered, so that a test for an empty string can be made when  
        the  end  of the group is reached. Some workspace is required for this;  
        if it runs out, this error is given.  
   
2063           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
2064    
2065         An invalid combination of PCRE_NEWLINE_xxx options was given.         An invalid combination of PCRE_NEWLINE_xxx options was given.
2066    
2067         Error numbers -16 to -20 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2068    
2069    
2070  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 2132  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2199  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2199    
2200         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2201         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2202         ate.         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2203           behaviour may not be what you want (see the next section).
2204    
2205    
2206  DUPLICATE SUBPATTERN NAMES  DUPLICATE SUBPATTERN NAMES
# Line 2140  DUPLICATE SUBPATTERN NAMES Line 2208  DUPLICATE SUBPATTERN NAMES
2208         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2209              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2210    
2211         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2212         subpatterns are not required to  be  unique.  Normally,  patterns  with         subpatterns  are  not  required  to  be unique. Normally, patterns with
2213         duplicate  names  are such that in any one match, only one of the named         duplicate names are such that in any one match, only one of  the  named
2214         subpatterns participates. An example is shown in the pcrepattern  docu-         subpatterns  participates. An example is shown in the pcrepattern docu-
2215         mentation. When duplicates are present, pcre_copy_named_substring() and         mentation. When duplicates are present, pcre_copy_named_substring() and
2216         pcre_get_named_substring() return the first substring corresponding  to         pcre_get_named_substring()  return the first substring corresponding to
2217         the  given  name  that  is  set.  If  none  are set, an empty string is         the given name that is set.  If  none  are  set,  an  empty  string  is
2218         returned.  The pcre_get_stringnumber() function returns one of the num-         returned.  The pcre_get_stringnumber() function returns one of the num-
2219         bers  that are associated with the name, but it is not defined which it         bers that are associated with the name, but it is not defined which  it
2220         is.         is.
2221    
2222         If you want to get full details of all captured substrings for a  given         If  you want to get full details of all captured substrings for a given
2223         name,  you  must  use  the pcre_get_stringtable_entries() function. The         name, you must use  the  pcre_get_stringtable_entries()  function.  The
2224         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2225         third  and  fourth  are  pointers to variables which are updated by the         third and fourth are pointers to variables which  are  updated  by  the
2226         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2227         the  name-to-number  table  for  the  given  name.  The function itself         the name-to-number table  for  the  given  name.  The  function  itself
2228         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2229         there  are none. The format of the table is described above in the sec-         there are none. The format of the table is described above in the  sec-
2230         tion entitled Information about a  pattern.   Given  all  the  relevant         tion  entitled  Information  about  a  pattern.  Given all the relevant
2231         entries  for the name, you can extract each of their numbers, and hence         entries for the name, you can extract each of their numbers, and  hence
2232         the captured data, if any.         the captured data, if any.
2233    
2234    
2235  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2236    
2237         The traditional matching function uses a  similar  algorithm  to  Perl,         The  traditional  matching  function  uses a similar algorithm to Perl,
2238         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2239         the subject. If you want to find all possible matches, or  the  longest         the  subject.  If you want to find all possible matches, or the longest
2240         possible  match,  consider using the alternative matching function (see         possible match, consider using the alternative matching  function  (see
2241         below) instead. If you cannot use the alternative function,  but  still         below)  instead.  If you cannot use the alternative function, but still
2242         need  to  find all possible matches, you can kludge it up by making use         need to find all possible matches, you can kludge it up by  making  use
2243         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2244         tation.         tation.
2245    
2246         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2247         tern.  When your callout function is called, extract and save the  cur-         tern.   When your callout function is called, extract and save the cur-
2248         rent  matched  substring.  Then  return  1, which forces pcre_exec() to         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2249         backtrack and try other alternatives. Ultimately, when it runs  out  of         backtrack  and  try other alternatives. Ultimately, when it runs out of
2250         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2251    
2252    
# Line 2189  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2257  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2257              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2258              int *workspace, int wscount);              int *workspace, int wscount);
2259    
2260         The  function  pcre_dfa_exec()  is  called  to  match  a subject string         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2261         against a compiled pattern, using a matching algorithm that  scans  the         against  a  compiled pattern, using a matching algorithm that scans the
2262         subject  string  just  once, and does not backtrack. This has different         subject string just once, and does not backtrack.  This  has  different
2263         characteristics to the normal algorithm, and  is  not  compatible  with         characteristics  to  the  normal  algorithm, and is not compatible with
2264         Perl.  Some  of the features of PCRE patterns are not supported. Never-         Perl. Some of the features of PCRE patterns are not  supported.  Never-
2265         theless, there are times when this kind of matching can be useful.  For         theless,  there are times when this kind of matching can be useful. For
2266         a discussion of the two matching algorithms, see the pcrematching docu-         a discussion of the two matching algorithms, see the pcrematching docu-
2267         mentation.         mentation.
2268    
2269         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2270         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2271         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2272         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2273         repeated here.         repeated here.
2274    
2275         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2276         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2277         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2278         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2279         lot of potential matches.         lot of potential matches.
2280    
2281         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 2229  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2297  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2297    
2298     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2299    
2300         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2301         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2302         LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,  PCRE_NO_UTF8_CHECK,         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2303         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2304         three of these are the same as for pcre_exec(), so their description is         three of these are the same as for pcre_exec(), so their description is
2305         not repeated here.         not repeated here.
2306    
2307           PCRE_PARTIAL           PCRE_PARTIAL
2308    
2309         This  has  the  same general effect as it does for pcre_exec(), but the         This has the same general effect as it does for  pcre_exec(),  but  the
2310         details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
2311         pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
2312         PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have         PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
2313         been no complete matches, but there is still at least one matching pos-         been no complete matches, but there is still at least one matching pos-
2314         sibility. The portion of the string that provided the partial match  is         sibility.  The portion of the string that provided the partial match is
2315         set as the first matching string.         set as the first matching string.
2316    
2317           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2318    
2319         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2320         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
2321         tive  algorithm  works, this is necessarily the shortest possible match         tive algorithm works, this is necessarily the shortest  possible  match
2322         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
2323    
2324           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2325    
2326         When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
2327         returns  a  partial  match, it is possible to call it again, with addi-         returns a partial match, it is possible to call it  again,  with  addi-
2328         tional subject characters, and have it continue with  the  same  match.         tional  subject  characters,  and have it continue with the same match.
2329         The  PCRE_DFA_RESTART  option requests this action; when it is set, the         The PCRE_DFA_RESTART option requests this action; when it is  set,  the
2330         workspace and wscount options must reference the same vector as  before         workspace  and wscount options must reference the same vector as before
2331         because  data  about  the  match so far is left in them after a partial         because data about the match so far is left in  them  after  a  partial
2332         match. There is more discussion of this  facility  in  the  pcrepartial         match.  There  is  more  discussion of this facility in the pcrepartial
2333         documentation.         documentation.
2334    
2335     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2336    
2337         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
2338         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
2339         of  the  function  start  at the same point in the subject. The shorter         of the function start at the same point in  the  subject.  The  shorter
2340         matches are all initial substrings of the longer matches. For  example,         matches  are all initial substrings of the longer matches. For example,
2341         if the pattern         if the pattern
2342    
2343           <.*>           <.*>
# Line 2284  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2352  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2352           <something> <something else>           <something> <something else>
2353           <something> <something else> <something further>           <something> <something else> <something further>
2354    
2355         On  success,  the  yield of the function is a number greater than zero,         On success, the yield of the function is a number  greater  than  zero,
2356         which is the number of matched substrings.  The  substrings  themselves         which  is  the  number of matched substrings. The substrings themselves
2357         are  returned  in  ovector. Each string uses two elements; the first is         are returned in ovector. Each string uses two elements;  the  first  is
2358         the offset to the start, and the second is the offset to  the  end.  In         the  offset  to  the start, and the second is the offset to the end. In
2359         fact,  all  the  strings  have the same start offset. (Space could have         fact, all the strings have the same start  offset.  (Space  could  have
2360         been saved by giving this only once, but it was decided to retain  some         been  saved by giving this only once, but it was decided to retain some
2361         compatibility  with  the  way pcre_exec() returns data, even though the         compatibility with the way pcre_exec() returns data,  even  though  the
2362         meaning of the strings is different.)         meaning of the strings is different.)
2363    
2364         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2365         est  matching  string is given first. If there were too many matches to         est matching string is given first. If there were too many  matches  to
2366         fit into ovector, the yield of the function is zero, and the vector  is         fit  into ovector, the yield of the function is zero, and the vector is
2367         filled with the longest matches.         filled with the longest matches.
2368    
2369     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2370    
2371         The  pcre_dfa_exec()  function returns a negative number when it fails.         The pcre_dfa_exec() function returns a negative number when  it  fails.
2372         Many of the errors are the same  as  for  pcre_exec(),  and  these  are         Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2373         described  above.   There are in addition the following errors that are         described above.  There are in addition the following errors  that  are
2374         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2375    
2376           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2377    
2378         This return is given if pcre_dfa_exec() encounters an item in the  pat-         This  return is given if pcre_dfa_exec() encounters an item in the pat-
2379         tern  that  it  does not support, for instance, the use of \C or a back         tern that it does not support, for instance, the use of \C  or  a  back
2380         reference.         reference.
2381    
2382           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2383    
2384         This return is given if pcre_dfa_exec()  encounters  a  condition  item         This  return  is  given  if pcre_dfa_exec() encounters a condition item
2385         that  uses  a back reference for the condition, or a test for recursion         that uses a back reference for the condition, or a test  for  recursion
2386         in a specific group. These are not supported.         in a specific group. These are not supported.
2387    
2388           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2389    
2390         This return is given if pcre_dfa_exec() is called with an  extra  block         This  return  is given if pcre_dfa_exec() is called with an extra block
2391         that contains a setting of the match_limit field. This is not supported         that contains a setting of the match_limit field. This is not supported
2392         (it is meaningless).         (it is meaningless).
2393    
2394           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2395    
2396         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2397         workspace vector.         workspace vector.
2398    
2399           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
2400    
2401         When  a  recursive subpattern is processed, the matching function calls         When a recursive subpattern is processed, the matching  function  calls
2402         itself recursively, using private vectors for  ovector  and  workspace.         itself  recursively,  using  private vectors for ovector and workspace.
2403         This  error  is  given  if  the output vector is not large enough. This         This error is given if the output vector  is  not  large  enough.  This
2404         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2405    
2406    
2407  SEE ALSO  SEE ALSO
2408    
2409         pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-         pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2410         tial(3),  pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).         tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).
2411    
2412    
2413  AUTHOR  AUTHOR
# Line 2351  AUTHOR Line 2419  AUTHOR
2419    
2420  REVISION  REVISION
2421    
2422         Last updated: 06 March 2007         Last updated: 30 July 2007
2423         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2424  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2425    
# Line 2379  PCRE CALLOUTS Line 2447  PCRE CALLOUTS
2447         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
2448         points:         points:
2449    
2450           (?C1)eabc(?C2)def           (?C1)abc(?C2)def
2451    
2452         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2453         called, PCRE automatically  inserts  callouts,  all  with  number  255,         called, PCRE automatically  inserts  callouts,  all  with  number  255,
# Line 2454  THE CALLOUT INTERFACE Line 2522  THE CALLOUT INTERFACE
2522         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2523         were passed to pcre_exec().         were passed to pcre_exec().
2524    
2525         The start_match field contains the offset within the subject  at  which         The start_match field normally contains the offset within  the  subject
2526         the  current match attempt started. If the pattern is not anchored, the         at  which  the  current  match  attempt started. However, if the escape
2527         callout function may be called several times from the same point in the         sequence \K has been encountered, this value is changed to reflect  the
2528         pattern for different starting points in the subject.         modified  starting  point.  If the pattern is not anchored, the callout
2529           function may be called several times from the same point in the pattern
2530           for different starting points in the subject.
2531    
2532         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
2533         the current match pointer.         the current match pointer.
# Line 2520  AUTHOR Line 2590  AUTHOR
2590    
2591  REVISION  REVISION
2592    
2593         Last updated: 06 March 2007         Last updated: 29 May 2007
2594         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2595  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2596    
# Line 2536  DIFFERENCES BETWEEN PCRE AND PERL Line 2606  DIFFERENCES BETWEEN PCRE AND PERL
2606    
2607         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2608         handle regular expressions. The differences described here  are  mainly         handle regular expressions. The differences described here  are  mainly
2609         with  respect  to  Perl 5.8, though PCRE version 7.0 contains some fea-         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2610         tures that are expected to be in the forthcoming Perl 5.10.         some features that are expected to be in the forthcoming Perl 5.10.
2611    
2612         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2613         of  what  it does have are given in the section on UTF-8 support in the         of  what  it does have are given in the section on UTF-8 support in the
# Line 2615  DIFFERENCES BETWEEN PCRE AND PERL Line 2685  DIFFERENCES BETWEEN PCRE AND PERL
2685         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2686    
2687         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2688         cial  meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash is         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2689         ignored. (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
2690    
2691         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2692         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
# Line 2648  AUTHOR Line 2718  AUTHOR
2718    
2719  REVISION  REVISION
2720    
2721         Last updated: 06 March 2007         Last updated: 13 June 2007
2722         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2723  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2724    
# Line 2681  PCRE REGULAR EXPRESSION DETAILS Line 2751  PCRE REGULAR EXPRESSION DETAILS
2751         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2752         From  release  6.0,   PCRE   offers   a   second   matching   function,         From  release  6.0,   PCRE   offers   a   second   matching   function,
2753         pcre_dfa_exec(),  which matches using a different algorithm that is not         pcre_dfa_exec(),  which matches using a different algorithm that is not
2754         Perl-compatible. The advantages and disadvantages  of  the  alternative         Perl-compatible. Some of the features discussed below are not available
2755         function, and how it differs from the normal function, are discussed in         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2756         the pcrematching page.         alternative function, and how it differs from the normal function,  are
2757           discussed in the pcrematching page.
2758    
2759    
2760  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
2761    
2762         A regular expression is a pattern that is  matched  against  a  subject         A  regular  expression  is  a pattern that is matched against a subject
2763         string  from  left  to right. Most characters stand for themselves in a         string from left to right. Most characters stand for  themselves  in  a
2764         pattern, and match the corresponding characters in the  subject.  As  a         pattern,  and  match  the corresponding characters in the subject. As a
2765         trivial example, the pattern         trivial example, the pattern
2766    
2767           The quick brown fox           The quick brown fox
2768    
2769         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2770         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2771         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2772         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2773         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2774         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2775         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2776         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2777         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2778    
2779         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2780         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2781         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2782         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2783    
2784         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2785         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2786         that are recognized within square brackets.  Outside  square  brackets,         that  are  recognized  within square brackets. Outside square brackets,
2787         the metacharacters are as follows:         the metacharacters are as follows:
2788    
2789           \      general escape character with several uses           \      general escape character with several uses
# Line 2731  CHARACTERS AND METACHARACTERS Line 2802  CHARACTERS AND METACHARACTERS
2802                  also "possessive quantifier"                  also "possessive quantifier"
2803           {      start min/max quantifier           {      start min/max quantifier
2804    
2805         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2806         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2807    
2808           \      general escape character           \      general escape character
# Line 2741  CHARACTERS AND METACHARACTERS Line 2812  CHARACTERS AND METACHARACTERS
2812                    syntax)                    syntax)
2813           ]      terminates the character class           ]      terminates the character class
2814    
2815         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2816    
2817    
2818  BACKSLASH  BACKSLASH
2819    
2820         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2821         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2822         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2823         applies both inside and outside character classes.         applies both inside and outside character classes.
2824    
2825         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2826         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2827         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2828         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2829         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2830         slash, you write \\.         slash, you write \\.
2831    
2832         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2833         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2834         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
2835         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
2836         part of the pattern.         part of the pattern.
2837    
2838         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2839         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2840         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2841         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
2842         tion. Note the following examples:         tion. Note the following examples:
2843    
2844           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2777  BACKSLASH Line 2848  BACKSLASH
2848           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2849           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2850    
2851         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2852         classes.         classes.
2853    
2854     Non-printing characters     Non-printing characters
2855    
2856         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2857         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
2858         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
2859         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
2860         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
2861         sequences than the binary character it represents:         sequences than the binary character it represents:
2862    
2863           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 2800  BACKSLASH Line 2871  BACKSLASH
2871           \xhh      character with hex code hh           \xhh      character with hex code hh
2872           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
2873    
2874         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
2875         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
2876         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
2877         becomes hex 7B.         becomes hex 7B.
2878    
2879         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
2880         in  upper  or  lower case). Any number of hexadecimal digits may appear         in upper or lower case). Any number of hexadecimal  digits  may  appear
2881         between \x{ and }, but the value of the character  code  must  be  less         between  \x{  and  },  but the value of the character code must be less
2882         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2883         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than         the  maximum  hexadecimal  value is 7FFFFFFF). If characters other than
2884         hexadecimal  digits  appear between \x{ and }, or if there is no termi-         hexadecimal digits appear between \x{ and }, or if there is  no  termi-
2885         nating }, this form of escape is not recognized.  Instead, the  initial         nating  }, this form of escape is not recognized.  Instead, the initial
2886         \x will be interpreted as a basic hexadecimal escape, with no following         \x will be interpreted as a basic hexadecimal escape, with no following
2887         digits, giving a character whose value is zero.         digits, giving a character whose value is zero.
2888    
2889         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2890         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x. There is no difference in the way  they  are  han-
2891         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
2892    
2893         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
2894         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
2895         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2896         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
2897         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
2898    
2899         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2900         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2901         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
2902         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2903         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
2904         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
2905         of parenthesized subpatterns.         of parenthesized subpatterns.
2906    
2907         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
2908         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
2909         up to three octal digits following the backslash, and uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
2910         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves.  In
2911         non-UTF-8 mode, the value of a character specified  in  octal  must  be         non-UTF-8  mode,  the  value  of a character specified in octal must be
2912         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
2913         example:         example:
2914    
2915           \040   is another way of writing a space           \040   is another way of writing a space
# Line 2856  BACKSLASH Line 2927  BACKSLASH
2927           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2928                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2929    
2930         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
2931         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2932    
2933         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
2934         inside and outside character classes. In addition, inside  a  character         inside  and  outside character classes. In addition, inside a character
2935         class,  the  sequence \b is interpreted as the backspace character (hex         class, the sequence \b is interpreted as the backspace  character  (hex
2936         08), and the sequences \R and \X are interpreted as the characters  "R"         08),  and the sequences \R and \X are interpreted as the characters "R"
2937         and  "X", respectively. Outside a character class, these sequences have         and "X", respectively. Outside a character class, these sequences  have
2938         different meanings (see below).         different meanings (see below).
2939    
2940     Absolute and relative back references     Absolute and relative back references
2941    
2942         The sequence \g followed by a positive or negative  number,  optionally         The  sequence  \g followed by a positive or negative number, optionally
2943         enclosed  in  braces,  is  an absolute or relative back reference. Back         enclosed in braces, is an absolute or relative back reference. A  named
2944         references are discussed later, following the discussion  of  parenthe-         back  reference can be coded as \g{name}. Back references are discussed
2945         sized subpatterns.         later, following the discussion of parenthesized subpatterns.
2946    
2947     Generic character types     Generic character types
2948    
# Line 2880  BACKSLASH Line 2951  BACKSLASH
2951    
2952           \d     any decimal digit           \d     any decimal digit
2953           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
2954             \h     any horizontal whitespace character
2955             \H     any character that is not a horizontal whitespace character
2956           \s     any whitespace character           \s     any whitespace character
2957           \S     any character that is not a whitespace character           \S     any character that is not a whitespace character
2958             \v     any vertical whitespace character
2959             \V     any character that is not a vertical whitespace character
2960           \w     any "word" character           \w     any "word" character
2961           \W     any "non-word" character           \W     any "non-word" character
2962    
2963         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2964         into  two disjoint sets. Any given character matches one, and only one,         into two disjoint sets. Any given character matches one, and only  one,
2965         of each pair.         of each pair.
2966    
2967         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2968         acter  classes.  They each match one character of the appropriate type.         acter classes. They each match one character of the  appropriate  type.
2969         If the current matching point is at the end of the subject string,  all         If  the current matching point is at the end of the subject string, all
2970         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2971    
2972         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
2973         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
2974         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
2975         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
2976         ter. In PCRE, it never does.)         ter. In PCRE, it never does.
   
        A "word" character is an underscore or any character less than 256 that  
        is a letter or digit. The definition of  letters  and  digits  is  con-  
        trolled  by PCRE's low-valued character tables, and may vary if locale-  
        specific matching is taking place (see "Locale support" in the  pcreapi  
        page).  For  example,  in  the  "fr_FR" (French) locale, some character  
        codes greater than 128 are used for accented  letters,  and  these  are  
        matched by \w.  
2977    
2978         In  UTF-8 mode, characters with values greater than 128 never match \d,         In UTF-8 mode, characters with values greater than 128 never match  \d,
2979         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2980         code  character  property support is available. The use of locales with         code character property support is available.  These  sequences  retain
2981         Unicode is discouraged.         their original meanings from before UTF-8 support was available, mainly
2982           for efficiency reasons.
2983    
2984           The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
2985           the  other  sequences, these do match certain high-valued codepoints in
2986           UTF-8 mode.  The horizontal space characters are:
2987    
2988             U+0009     Horizontal tab
2989             U+0020     Space
2990             U+00A0     Non-break space
2991             U+1680     Ogham space mark
2992             U+180E     Mongolian vowel separator
2993             U+2000     En quad
2994             U+2001     Em quad
2995             U+2002     En space
2996             U+2003     Em space
2997             U+2004     Three-per-em space
2998             U+2005     Four-per-em space
2999             U+2006     Six-per-em space
3000             U+2007     Figure space
3001             U+2008     Punctuation space
3002             U+2009     Thin space
3003             U+200A     Hair space
3004             U+202F     Narrow no-break space
3005             U+205F     Medium mathematical space
3006             U+3000     Ideographic space
3007    
3008           The vertical space characters are:
3009    
3010             U+000A     Linefeed
3011             U+000B     Vertical tab
3012             U+000C     Formfeed
3013             U+000D     Carriage return
3014             U+0085     Next line
3015             U+2028     Line separator
3016             U+2029     Paragraph separator
3017    
3018           A "word" character is an underscore or any character less than 256 that
3019           is  a  letter  or  digit.  The definition of letters and digits is con-
3020           trolled by PCRE's low-valued character tables, and may vary if  locale-
3021           specific  matching is taking place (see "Locale support" in the pcreapi
3022           page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3023           systems,  or "french" in Windows, some character codes greater than 128
3024           are used for accented letters, and these are matched by \w. The use  of
3025           locales with Unicode is discouraged.
3026    
3027     Newline sequences     Newline sequences
3028    
3029         Outside a character class, the escape sequence \R matches  any  Unicode         Outside  a  character class, the escape sequence \R matches any Unicode
3030         newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is         newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R  is
3031         equivalent to the following:         equivalent to the following:
3032    
3033           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3034    
3035         This is an example of an "atomic group", details  of  which  are  given         This  is  an  example  of an "atomic group", details of which are given
3036         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3037         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3038         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3039         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3040         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3041    
3042         In  UTF-8  mode, two additional characters whose codepoints are greater         In UTF-8 mode, two additional characters whose codepoints  are  greater
3043         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3044         rator,  U+2029).   Unicode character property support is not needed for         rator, U+2029).  Unicode character property support is not  needed  for
3045         these characters to be recognized.         these characters to be recognized.
3046    
3047         Inside a character class, \R matches the letter "R".         Inside a character class, \R matches the letter "R".
# Line 2938  BACKSLASH Line 3049  BACKSLASH
3049     Unicode character properties     Unicode character properties
3050    
3051         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3052         tional  escape  sequences  to  match character properties are available         tional escape sequences that match characters with specific  properties
3053         when UTF-8 mode is selected. They are:         are  available.   When not in UTF-8 mode, these sequences are of course
3054           limited to testing characters whose codepoints are less than  256,  but
3055           they do work in this mode.  The extra escape sequences are:
3056    
3057           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3058           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3059           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3060    
3061         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
3062         script names, the general category properties, and "Any", which matches         script names, the general category properties, and "Any", which matches
3063         any character (including newline). Other properties such as "InMusical-         any character (including newline). Other properties such as "InMusical-
3064         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does         Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
3065         not match any characters, so always causes a match failure.         not match any characters, so always causes a match failure.
3066    
3067         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3068         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3069         For example:         For example:
3070    
3071           \p{Greek}           \p{Greek}
3072           \P{Han}           \P{Han}
3073    
3074         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3075         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3076    
3077         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
3078         Buhid,  Canadian_Aboriginal,  Cherokee,  Common,   Coptic,   Cuneiform,         Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
3079         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3080         Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew,  Hira-         Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3081         gana,  Inherited,  Kannada,  Katakana,  Kharoshthi,  Khmer, Lao, Latin,         gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
3082         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
3083         Ogham,  Old_Italic,  Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,         Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
3084         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
3085         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3086    
3087         Each  character has exactly one general category property, specified by         Each character has exactly one general category property, specified  by
3088         a two-letter abbreviation. For compatibility with Perl, negation can be         a two-letter abbreviation. For compatibility with Perl, negation can be
3089         specified  by  including a circumflex between the opening brace and the         specified by including a circumflex between the opening brace  and  the
3090         property name. For example, \p{^Lu} is the same as \P{Lu}.         property name. For example, \p{^Lu} is the same as \P{Lu}.
3091    
3092         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3093         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3094         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3095         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3096    
3097           \p{L}           \p{L}
# Line 3030  BACKSLASH Line 3143  BACKSLASH
3143           Zp    Paragraph separator           Zp    Paragraph separator
3144           Zs    Space separator           Zs    Space separator
3145    
3146         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3147         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3148         classified as a modifier or "other".         classified as a modifier or "other".
3149    
3150         The  long  synonyms  for  these  properties that Perl supports (such as         The long synonyms for these properties  that  Perl  supports  (such  as
3151         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3152         any of these properties with "Is".         any of these properties with "Is".
3153    
3154         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3155         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3156         in the Unicode table.         in the Unicode table.
3157    
3158         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3159         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3160    
3161         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3162         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3163    
3164           (?>\PM\pM*)           (?>\PM\pM*)
3165    
3166         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3167         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3168         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3169         property are typically accents that affect the preceding character.         property  are  typically  accents  that affect the preceding character.
3170           None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3171           matches any one character.
3172    
3173         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3174         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3175         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3176         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
3177    
3178       Resetting the match start
3179    
3180           The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3181           ously  matched  characters  not  to  be  included  in the final matched
3182           sequence. For example, the pattern:
3183    
3184             foo\Kbar
3185    
3186           matches "foobar", but reports that it has matched "bar".  This  feature
3187           is  similar  to  a lookbehind assertion (described below).  However, in
3188           this case, the part of the subject before the real match does not  have
3189           to  be of fixed length, as lookbehind assertions do. The use of \K does
3190           not interfere with the setting of captured  substrings.   For  example,
3191           when the pattern
3192    
3193             (foo)\Kbar
3194    
3195           matches "foobar", the first substring is still set to "foo".
3196    
3197     Simple assertions     Simple assertions
3198    
3199         The  final use of backslash is for certain simple assertions. An asser-         The  final use of backslash is for certain simple assertions. An asser-
# Line 3275  SQUARE BRACKETS AND CHARACTER CLASSES Line 3409  SQUARE BRACKETS AND CHARACTER CLASSES
3409         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
3410         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
3411         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
3412         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
3413         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
3414         concept of case for characters with values greater than 128  only  when         concept of case for characters with values greater than 128  only  when
3415         it is compiled with Unicode property support.         it is compiled with Unicode property support.
# Line 3460  SUBPATTERNS Line 3594  SUBPATTERNS
3594         "Saturday".         "Saturday".
3595    
3596    
3597    DUPLICATE SUBPATTERN NUMBERS
3598    
3599           Perl 5.10 introduced a feature whereby each alternative in a subpattern
3600           uses the same numbers for its capturing parentheses. Such a  subpattern
3601           starts  with (?| and is itself a non-capturing subpattern. For example,
3602           consider this pattern:
3603    
3604             (?|(Sat)ur|(Sun))day
3605    
3606           Because the two alternatives are inside a (?| group, both sets of  cap-
3607           turing  parentheses  are  numbered one. Thus, when the pattern matches,
3608           you can look at captured substring number  one,  whichever  alternative
3609           matched.  This  construct  is useful when you want to capture part, but
3610           not all, of one of a number of alternatives. Inside a (?| group, paren-
3611           theses  are  numbered as usual, but the number is reset at the start of
3612           each branch. The numbers of any capturing buffers that follow the  sub-
3613           pattern  start after the highest number used in any branch. The follow-
3614           ing example is taken from the Perl documentation.  The  numbers  under-
3615           neath show in which buffer the captured content will be stored.
3616    
3617             # before  ---------------branch-reset----------- after
3618             / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3619             # 1            2         2  3        2     3     4
3620    
3621           A  backreference  or  a  recursive call to a numbered subpattern always
3622           refers to the first one in the pattern with the given number.
3623    
3624           An alternative approach to using this "branch reset" feature is to  use
3625           duplicate named subpatterns, as described in the next section.
3626    
3627    
3628  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3629    
3630         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying  capturing  parentheses  by number is simple, but it can be
# Line 3499  NAMED SUBPATTERNS Line 3664  NAMED SUBPATTERNS
3664           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
3665    
3666         There  are  five capturing substrings, but only one is ever set after a         There  are  five capturing substrings, but only one is ever set after a
3667         match.  The convenience  function  for  extracting  the  data  by  name         match.  (An alternative way of solving this problem is to use a "branch
3668         returns  the  substring  for  the first (and in this example, the only)         reset" subpattern, as described in the previous section.)
3669         subpattern of that name that matched.  This  saves  searching  to  find  
3670         which  numbered  subpattern  it  was. If you make a reference to a non-         The  convenience  function  for extracting the data by name returns the
3671         unique named subpattern from elsewhere in the  pattern,  the  one  that         substring for the first (and in this example, the only)  subpattern  of
3672         corresponds  to  the  lowest number is used. For further details of the         that  name  that  matched.  This saves searching to find which numbered
3673         interfaces for handling named subpatterns, see the  pcreapi  documenta-         subpattern it was. If you make a reference to a non-unique  named  sub-
3674         tion.         pattern  from elsewhere in the pattern, the one that corresponds to the
3675           lowest number is used. For further details of the interfaces  for  han-
3676           dling named subpatterns, see the pcreapi documentation.
3677    
3678    
3679  REPETITION  REPETITION
# Line 3821  BACK REFERENCES Line 3988  BACK REFERENCES
3988         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3989         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3990    
3991         Back references to named subpatterns use the Perl  syntax  \k<name>  or         There are several different ways of writing back  references  to  named
3992         \k'name'  or  the  Python  syntax (?P=name). We could rewrite the above         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
3993         example in either of the following ways:         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
3994           unified back reference syntax, in which \g can be used for both numeric
3995           and named references, is also supported. We  could  rewrite  the  above
3996           example in any of the following ways:
3997    
3998           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
3999             (?'p1'(?i)rah)\s+\k{p1}
4000           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4001             (?<p1>(?i)rah)\s+\g{p1}
4002    
4003         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
4004         before or after the reference.         before or after the reference.
4005    
4006         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
4007         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
4008         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
4009    
4010           (a|(bc))\2           (a|(bc))\2
4011    
4012         always  fails if it starts to match "a" rather than "bc". Because there         always fails if it starts to match "a" rather than "bc". Because  there
4013         may be many capturing parentheses in a pattern,  all  digits  following         may  be  many  capturing parentheses in a pattern, all digits following
4014         the  backslash  are taken as part of a potential back reference number.         the backslash are taken as part of a potential back  reference  number.
4015         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
4016         used  to  terminate  the back reference. If the PCRE_EXTENDED option is         used to terminate the back reference. If the  PCRE_EXTENDED  option  is
4017         set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
4018         ments" below) can be used.         ments" below) can be used.
4019    
4020         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
4021         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
4022         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
4023         patterns. For example, the pattern         patterns. For example, the pattern
4024    
4025           (a|b\1)+           (a|b\1)+
4026    
4027         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4028         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
4029         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
4030         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
4031         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
4032         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4033    
4034    
4035  ASSERTIONS  ASSERTIONS
4036    
4037         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
4038         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
4039         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
4040         described above.         described above.
4041    
4042         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
4043         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
4044         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
4045         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
4046         matching position to be changed.         matching position to be changed.
4047    
4048         Assertion subpatterns are not capturing subpatterns,  and  may  not  be         Assertion  subpatterns  are  not  capturing subpatterns, and may not be
4049         repeated,  because  it  makes no sense to assert the same thing several         repeated, because it makes no sense to assert the  same  thing  several
4050         times. If any kind of assertion contains capturing  subpatterns  within         times.  If  any kind of assertion contains capturing subpatterns within
4051         it,  these are counted for the purposes of numbering the capturing sub-         it, these are counted for the purposes of numbering the capturing  sub-
4052         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4053         out  only  for  positive assertions, because it does not make sense for         out only for positive assertions, because it does not  make  sense  for
4054         negative assertions.         negative assertions.
4055    
4056     Lookahead assertions     Lookahead assertions
# Line 3888  ASSERTIONS Line 4060  ASSERTIONS
4060    
4061           \w+(?=;)           \w+(?=;)
4062    
4063         matches  a word followed by a semicolon, but does not include the semi-         matches a word followed by a semicolon, but does not include the  semi-
4064         colon in the match, and         colon in the match, and
4065    
4066           foo(?!bar)           foo(?!bar)
4067    
4068         matches any occurrence of "foo" that is not  followed  by  "bar".  Note         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
4069         that the apparently similar pattern         that the apparently similar pattern
4070    
4071           (?!foo)bar           (?!foo)bar
4072    
4073         does  not  find  an  occurrence  of "bar" that is preceded by something         does not find an occurrence of "bar"  that  is  preceded  by  something
4074         other than "foo"; it finds any occurrence of "bar" whatsoever,  because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
4075         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4076         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4077    
4078         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4079         most  convenient  way  to  do  it  is with (?!) because an empty string         most convenient way to do it is  with  (?!)  because  an  empty  string
4080         always matches, so an assertion that requires there not to be an  empty         always  matches, so an assertion that requires there not to be an empty
4081         string must always fail.         string must always fail.
4082    
4083     Lookbehind assertions     Lookbehind assertions
4084    
4085         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
4086         for negative assertions. For example,         for negative assertions. For example,
4087    
4088           (?<!foo)bar           (?<!foo)bar
4089    
4090         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4091         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4092         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4093         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4094         fixed length. Thus         fixed length. Thus
4095    
4096           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 3927  ASSERTIONS Line 4099  ASSERTIONS
4099    
4100           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4101    
4102         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4103         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4104         This is an extension compared with  Perl  (at  least  for  5.8),  which         This  is  an  extension  compared  with  Perl (at least for 5.8), which
4105         requires  all branches to match the same length of string. An assertion         requires all branches to match the same length of string. An  assertion
4106         such as         such as
4107    
4108           (?<=ab(c|de))           (?<=ab(c|de))
4109    
4110         is not permitted, because its single top-level  branch  can  match  two         is  not  permitted,  because  its single top-level branch can match two
4111         different  lengths,  but  it is acceptable if rewritten to use two top-         different lengths, but it is acceptable if rewritten to  use  two  top-
4112         level branches:         level branches:
4113    
4114           (?<=abc|abde)           (?<=abc|abde)
4115    
4116         The implementation of lookbehind assertions is, for  each  alternative,         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4117         to  temporarily  move the current position back by the fixed length and         instead of a lookbehind assertion; this is not restricted to  a  fixed-
4118           length.
4119    
4120           The  implementation  of lookbehind assertions is, for each alternative,
4121           to temporarily move the current position back by the fixed  length  and
4122         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4123         rent position, the assertion fails.         rent position, the assertion fails.
4124    
4125         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4126         mode) to appear in lookbehind assertions, because it makes it  impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
4127         ble  to  calculate the length of the lookbehind. The \X and \R escapes,         ble to calculate the length of the lookbehind. The \X and  \R  escapes,
4128         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
4129    
4130         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
4131         assertions  to  specify  efficient  matching  at the end of the subject         assertions to specify efficient matching at  the  end  of  the  subject
4132         string. Consider a simple pattern such as         string. Consider a simple pattern such as
4133    
4134           abcd$           abcd$
4135    
4136         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
4137         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4138         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
4139         pattern is specified as         pattern is specified as
4140    
4141           ^.*abcd$           ^.*abcd$
4142    
4143         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
4144         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4145         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
4146         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
4147         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4148    
4149           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4150    
4151         there  can  be  no backtracking for the .*+ item; it can match only the         there can be no backtracking for the .*+ item; it can  match  only  the
4152         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
4153         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
4154         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
4155         processing time.         processing time.
4156    
4157     Using multiple assertions     Using multiple assertions
# Line 3984  ASSERTIONS Line 4160  ASSERTIONS
4160    
4161           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4162    
4163         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
4164         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
4165         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
4166         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
4167         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4168         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
4169         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
4170         foo". A pattern to do that is         foo". A pattern to do that is
4171    
4172           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4173    
4174         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
4175         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4176         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4177    
# Line 4003  ASSERTIONS Line 4179  ASSERTIONS
4179    
4180           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4181    
4182         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
4183         is not preceded by "foo", while         is not preceded by "foo", while
4184    
4185           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4186    
4187         is  another pattern that matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
4188         three characters that are not "999".         three characters that are not "999".
4189    
4190    
4191  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4192    
4193         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
4194         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
4195         on the result of an assertion, or whether a previous capturing  subpat-         on  the result of an assertion, or whether a previous capturing subpat-
4196         tern  matched  or not. The two possible forms of conditional subpattern         tern matched or not. The two possible forms of  conditional  subpattern
4197         are         are
4198    
4199           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4200           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4201    
4202         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
4203         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4204         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4205    
4206         There are four kinds of condition: references  to  subpatterns,  refer-         There  are  four  kinds of condition: references to subpatterns, refer-
4207         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4208    
4209     Checking for a used subpattern by number     Checking for a used subpattern by number
4210    
4211         If  the  text between the parentheses consists of a sequence of digits,         If the text between the parentheses consists of a sequence  of  digits,
4212         the condition is true if the capturing subpattern of  that  number  has         the  condition  is  true if the capturing subpattern of that number has
4213         previously matched.         previously matched. An alternative notation is to  precede  the  digits
4214           with a plus or minus sign. In this case, the subpattern number is rela-
4215           tive rather than absolute.  The most recently opened parentheses can be
4216           referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
4217           looping constructs it can also make sense to refer to subsequent groups
4218           with constructs such as (?(+2).
4219    
4220         Consider  the  following  pattern, which contains non-significant white         Consider  the  following  pattern, which contains non-significant white
4221         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
# Line 4053  CONDITIONAL SUBPATTERNS Line 4234  CONDITIONAL SUBPATTERNS
4234         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
4235         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4236    
4237           If you were embedding this pattern in a larger one,  you  could  use  a
4238           relative reference:
4239    
4240             ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4241    
4242           This  makes  the  fragment independent of the parentheses in the larger
4243           pattern.
4244    
4245     Checking for a used subpattern by name     Checking for a used subpattern by name
4246    
4247         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
# Line 4194  RECURSIVE PATTERNS Line 4383  RECURSIVE PATTERNS
4383           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4384    
4385         We  have  put the pattern into parentheses, and caused the recursion to         We  have  put the pattern into parentheses, and caused the recursion to
4386         refer to them instead of the whole pattern. In a larger pattern,  keep-         refer to them instead of the whole pattern.
4387         ing  track  of parenthesis numbers can be tricky. It may be more conve-  
4388         nient to use named parentheses instead. The Perl  syntax  for  this  is         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
4389         (?&name);  PCRE's  earlier syntax (?P>name) is also supported. We could         tricky.  This is made easier by the use of relative references. (A Perl
4390         rewrite the above example as follows:         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
4391           (?-2) to refer to the second most recently opened parentheses preceding
4392           the recursion. In other  words,  a  negative  number  counts  capturing
4393           parentheses leftwards from the point at which it is encountered.
4394    
4395           It  is  also  possible  to refer to subsequently opened parentheses, by
4396           writing references such as (?+2). However, these  cannot  be  recursive
4397           because  the  reference  is  not inside the parentheses that are refer-
4398           enced. They are always "subroutine" calls, as  described  in  the  next
4399           section.
4400    
4401           An  alternative  approach is to use named parentheses instead. The Perl
4402           syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
4403           supported. We could rewrite the above example as follows:
4404    
4405           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )           (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4406    
4407         If there is more than one subpattern with the same name,  the  earliest         If  there  is more than one subpattern with the same name, the earliest
4408         one  is used. This particular example pattern contains nested unlimited         one is used.
4409         repeats, and so the use of atomic grouping for matching strings of non-  
4410         parentheses  is  important when applying the pattern to strings that do         This particular example pattern that we have been looking  at  contains
4411         not match. For example, when this pattern is applied to         nested  unlimited repeats, and so the use of atomic grouping for match-
4412           ing strings of non-parentheses is important when applying  the  pattern
4413           to strings that do not match. For example, when this pattern is applied
4414           to
4415    
4416           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4417    
# Line 4256  SUBPATTERNS AS SUBROUTINES Line 4461  SUBPATTERNS AS SUBROUTINES
4461         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
4462         by  name)  is used outside the parentheses to which it refers, it oper-         by  name)  is used outside the parentheses to which it refers, it oper-
4463         ates like a subroutine in a programming language. The "called"  subpat-         ates like a subroutine in a programming language. The "called"  subpat-
4464         tern  may  be defined before or after the reference. An earlier example         tern may be defined before or after the reference. A numbered reference
4465         pointed out that the pattern         can be absolute or relative, as in these examples:
4466    
4467             (...(absolute)...)...(?2)...
4468             (...(relative)...)...(?-1)...
4469             (...(?+1)...(relative)...
4470    
4471           An earlier example pointed out that the pattern
4472    
4473           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4474    
# Line 4279  SUBPATTERNS AS SUBROUTINES Line 4490  SUBPATTERNS AS SUBROUTINES
4490         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
4491         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
4492    
4493           (abc)(?i:(?1))           (abc)(?i:(?-1))
4494    
4495         It matches "abcabc". It does not match "abcABC" because the  change  of         It matches "abcabc". It does not match "abcABC" because the  change  of
4496         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
# Line 4334  AUTHOR Line 4545  AUTHOR
4545    
4546  REVISION  REVISION
4547    
4548         Last updated: 06 March 2007         Last updated: 19 June 2007
4549         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4550  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4551    
# Line 4415  RESTRICTED PATTERNS FOR PCRE_PARTIAL Line 4626  RESTRICTED PATTERNS FOR PCRE_PARTIAL
4626    
4627         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the         If PCRE_PARTIAL is set for a pattern  that  does  not  conform  to  the
4628         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL         restrictions,  pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
4629         (-13).         (-13).  You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo()  to
4630           find out if a compiled pattern can be used for partial matching.
4631    
4632    
4633  EXAMPLE OF PARTIAL MATCHING USING PCRETEST  EXAMPLE OF PARTIAL MATCHING USING PCRETEST
4634    
4635         If the escape sequence \P is present  in  a  pcretest  data  line,  the         If  the  escape  sequence  \P  is  present in a pcretest data line, the
4636         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that         PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
4637         uses the date example quoted above:         uses the date example quoted above:
4638    
# Line 4437  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4649  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4649           data> j\P           data> j\P
4650           No match           No match
4651    
4652         The first data string is matched  completely,  so  pcretest  shows  the         The  first  data  string  is  matched completely, so pcretest shows the
4653         matched  substrings.  The  remaining four strings do not match the com-         matched substrings. The remaining four strings do not  match  the  com-
4654         plete pattern, but the first two are partial matches.  The  same  test,         plete  pattern,  but  the first two are partial matches. The same test,
4655         using  pcre_dfa_exec()  matching  (by means of the \D escape sequence),         using pcre_dfa_exec() matching (by means of the  \D  escape  sequence),
4656         produces the following output:         produces the following output:
4657    
4658             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4659           data> 25jun04\P\D           data> 25jun04\P\D
4660            0: 25jun04            0: 25jun04
4661           data> 23dec3\P\D           data> 23dec3\P\D
# Line 4455  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4667  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4667           data> j\P\D           data> j\P\D
4668           No match           No match
4669    
4670         Notice that in this case the portion of the string that was matched  is         Notice  that in this case the portion of the string that was matched is
4671         made available.         made available.
4672    
4673    
4674  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()  MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
4675    
4676         When a partial match has been found using pcre_dfa_exec(), it is possi-         When a partial match has been found using pcre_dfa_exec(), it is possi-
4677         ble to continue the match by  providing  additional  subject  data  and         ble  to  continue  the  match  by providing additional subject data and
4678         calling  pcre_dfa_exec()  again  with the same compiled regular expres-         calling pcre_dfa_exec() again with the same  compiled  regular  expres-
4679         sion, this time setting the PCRE_DFA_RESTART option. You must also pass         sion, this time setting the PCRE_DFA_RESTART option. You must also pass
4680         the  same working space as before, because this is where details of the         the same working space as before, because this is where details of  the
4681         previous partial match are stored. Here is an example  using  pcretest,         previous  partial  match are stored. Here is an example using pcretest,
4682         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and         using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
4683         \D are as above):         \D are as above):
4684    
4685             re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4686           data> 23ja\P\D           data> 23ja\P\D
4687           Partial match: 23ja           Partial match: 23ja
4688           data> n05\R\D           data> n05\R\D
4689            0: n05            0: n05
4690    
4691         The first call has "23ja" as the subject, and requests  partial  match-         The  first  call has "23ja" as the subject, and requests partial match-
4692         ing;  the  second  call  has  "n05"  as  the  subject for the continued         ing; the second call  has  "n05"  as  the  subject  for  the  continued
4693         (restarted) match.  Notice that when the match is  complete,  only  the         (restarted)  match.   Notice  that when the match is complete, only the
4694         last  part  is  shown;  PCRE  does not retain the previously partially-         last part is shown; PCRE does  not  retain  the  previously  partially-
4695         matched string. It is up to the calling program to do that if it  needs         matched  string. It is up to the calling program to do that if it needs
4696         to.         to.
4697    
4698         You  can  set  PCRE_PARTIAL  with  PCRE_DFA_RESTART to continue partial         You can set PCRE_PARTIAL  with  PCRE_DFA_RESTART  to  continue  partial
4699         matching over multiple segments. This facility can be used to pass very         matching over multiple segments. This facility can be used to pass very
4700         long  subject  strings to pcre_dfa_exec(). However, some care is needed         long subject strings to pcre_dfa_exec(). However, some care  is  needed
4701         for certain types of pattern.         for certain types of pattern.
4702    
4703         1. If the pattern contains tests for the beginning or end  of  a  line,         1.  If  the  pattern contains tests for the beginning or end of a line,
4704         you  need  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-         you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options,  as  appropri-
4705         ate, when the subject string for any call does not contain  the  begin-         ate,  when  the subject string for any call does not contain the begin-
4706         ning or end of a line.         ning or end of a line.
4707    
4708         2.  If  the  pattern contains backward assertions (including \b or \B),         2. If the pattern contains backward assertions (including  \b  or  \B),
4709         you need to arrange for some overlap in the subject  strings  to  allow         you  need  to  arrange for some overlap in the subject strings to allow
4710         for  this.  For  example, you could pass the subject in chunks that are         for this. For example, you could pass the subject in  chunks  that  are
4711         500 bytes long, but in a buffer of 700 bytes, with the starting  offset         500  bytes long, but in a buffer of 700 bytes, with the starting offset
4712         set to 200 and the previous 200 bytes at the start of the buffer.         set to 200 and the previous 200 bytes at the start of the buffer.
4713    
4714         3.  Matching a subject string that is split into multiple segments does         3. Matching a subject string that is split into multiple segments  does
4715         not always produce exactly the same result as matching over one  single         not  always produce exactly the same result as matching over one single
4716         long  string.   The  difference arises when there are multiple matching         long string.  The difference arises when there  are  multiple  matching
4717         possibilities, because a partial match result is given only when  there         possibilities,  because a partial match result is given only when there
4718         are  no  completed  matches  in a call to fBpcre_dfa_exec(). This means         are no completed matches in a call to pcre_dfa_exec(). This means  that
4719         that as soon as the shortest match has been found,  continuation  to  a         as  soon  as  the  shortest match has been found, continuation to a new
4720         new  subject  segment  is  no  longer possible.  Consider this pcretest         subject segment is no longer possible.  Consider this pcretest example:
        example:  
4721    
4722             re> /dog(sbody)?/             re> /dog(sbody)?/
4723           data> do\P\D           data> do\P\D
# Line 4517  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4728  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4728            0: dogsbody            0: dogsbody
4729            1: dog            1: dog
4730    
4731         The pattern matches the words "dog" or "dogsbody". When the subject  is         The  pattern matches the words "dog" or "dogsbody". When the subject is
4732         presented  in  several  parts  ("do" and "gsb" being the first two) the         presented in several parts ("do" and "gsb" being  the  first  two)  the
4733         match stops when "dog" has been found, and it is not possible  to  con-         match  stops  when "dog" has been found, and it is not possible to con-
4734         tinue.  On  the  other  hand,  if  "dogsbody"  is presented as a single         tinue. On the other hand,  if  "dogsbody"  is  presented  as  a  single
4735         string, both matches are found.         string, both matches are found.
4736    
4737         Because of this phenomenon, it does not usually make  sense  to  end  a         Because  of  this  phenomenon,  it does not usually make sense to end a
4738         pattern that is going to be matched in this way with a variable repeat.         pattern that is going to be matched in this way with a variable repeat.
4739    
4740         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
# Line 4532  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4743  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4743    
4744           1234|3789           1234|3789
4745    
4746         If the first part of the subject is "ABC123", a partial  match  of  the         If  the  first  part of the subject is "ABC123", a partial match of the
4747         first  alternative  is found at offset 3. There is no partial match for         first alternative is found at offset 3. There is no partial  match  for
4748         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
4749         point  in  the  subject  string. Attempting to continue with the string         point in the subject string. Attempting to  continue  with  the  string
4750         "789" does not yield a match because only those alternatives that match         "789" does not yield a match because only those alternatives that match
4751         at  one point in the subject are remembered. The problem arises because         at one point in the subject are remembered. The problem arises  because
4752         the start of the second alternative matches within the  first  alterna-         the  start  of the second alternative matches within the first alterna-
4753         tive. There is no problem with anchored patterns or patterns such as:         tive. There is no problem with anchored patterns or patterns such as:
4754    
4755           1234|ABCD           1234|ABCD
# Line 4555  AUTHOR Line 4766  AUTHOR
4766    
4767  REVISION  REVISION
4768    
4769         Last updated: 06 March 2007         Last updated: 04 June 2007
4770         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4771  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4772    
# Line 4580  SAVING AND RE-USING PRECOMPILED PCRE PAT Line 4791  SAVING AND RE-USING PRECOMPILED PCRE PAT
4791         ent  host  and  run them there. This works even if the new host has the         ent  host  and  run them there. This works even if the new host has the
4792         opposite endianness to the one on which  the  patterns  were  compiled.         opposite endianness to the one on which  the  patterns  were  compiled.
4793         There  may  be a small performance penalty, but it should be insignifi-         There  may  be a small performance penalty, but it should be insignifi-
4794         cant.         cant. However, compiling regular expressions with one version  of  PCRE
4795           for  use  with  a  different  version is not guaranteed to work and may
4796           cause crashes.
4797    
4798    
4799  SAVING A COMPILED PATTERN  SAVING A COMPILED PATTERN
# Line 4663  RE-USING A PRECOMPILED PATTERN Line 4876  RE-USING A PRECOMPILED PATTERN
4876    
4877  COMPATIBILITY WITH DIFFERENT PCRE RELEASES  COMPATIBILITY WITH DIFFERENT PCRE RELEASES
4878    
4879         The layout of the control block that is at the start of the  data  that         In general, it is safest to  recompile  all  saved  patterns  when  you
4880         makes  up  a  compiled pattern was changed for release 5.0. If you have         update  to  a new PCRE release, though not all updates actually require
4881         any saved patterns that were compiled with  previous  releases  (not  a         this. Recompiling is definitely needed for release 7.2.
        facility  that  was  previously advertised), you will have to recompile  
        them for release 5.0 and above.  
   
        If you have any saved patterns in UTF-8 mode that use  \p  or  \P  that  
        were  compiled  with any release up to and including 6.4, you will have  
        to recompile them for release 6.5 and above.  
   
        All saved patterns from earlier releases must be recompiled for release  
        7.0  or  higher,  because  there was an internal reorganization at that  
        release.  
4882    
4883    
4884  AUTHOR  AUTHOR
# Line 4687  AUTHOR Line 4890  AUTHOR
4890    
4891  REVISION  REVISION
4892    
4893         Last updated: 06 March 2007         Last updated: 13 June 2007
4894         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
4895  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4896    
# Line 5155  MATCHING INTERFACE Line 5358  MATCHING INTERFACE
5358         return false (because the empty string is not a valid number):         return false (because the empty string is not a valid number):
5359    
5360            int number;            int number;
5361            pcrecpp::RE::FullMatch("abc", "[a-z]+(\d+)?", &number);            pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
5362    
5363         The matching interface supports at most 16 arguments per call.  If  you         The matching interface supports at most 16 arguments per call.  If  you
5364         need    more,    consider    using    the    more   general   interface         need    more,    consider    using    the    more   general   interface
# Line 5422  PCRE SAMPLE PROGRAM Line 5625  PCRE SAMPLE PROGRAM
5625         bility  of  matching an empty string. Comments in the code explain what         bility  of  matching an empty string. Comments in the code explain what
5626         is going on.         is going on.
5627    
5628         If PCRE is installed in the standard include  and  library  directories         The demonstration program is automatically built if you use  "./config-
5629         for  your  system, you should be able to compile the demonstration pro-         ure;make"  to  build PCRE. Otherwise, if PCRE is installed in the stan-
5630         gram using this command:         dard include and library directories for your  system,  you  should  be
5631           able to compile the demonstration program using this command:
5632    
5633           gcc -o pcredemo pcredemo.c -lpcre           gcc -o pcredemo pcredemo.c -lpcre
5634    
5635         If PCRE is installed elsewhere, you may need to add additional  options         If  PCRE is installed elsewhere, you may need to add additional options
5636         to  the  command line. For example, on a Unix-like system that has PCRE         to the command line. For example, on a Unix-like system that  has  PCRE
5637         installed in /usr/local, you  can  compile  the  demonstration  program         installed  in  /usr/local,  you  can  compile the demonstration program
5638         using a command like this:         using a command like this:
5639    
5640           gcc -o pcredemo -I/usr/local/include pcredemo.c \           gcc -o pcredemo -I/usr/local/include pcredemo.c \
5641               -L/usr/local/lib -lpcre               -L/usr/local/lib -lpcre
5642    
5643         Once  you  have  compiled the demonstration program, you can run simple         Once you have compiled the demonstration program, you  can  run  simple
5644         tests like this:         tests like this:
5645    
5646           ./pcredemo 'cat|dog' 'the cat sat on the mat'           ./pcredemo 'cat|dog' 'the cat sat on the mat'
5647           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'           ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
5648    
5649         Note that there is a  much  more  comprehensive  test  program,  called         Note  that  there  is  a  much  more comprehensive test program, called
5650         pcretest,  which  supports  many  more  facilities  for testing regular         pcretest, which supports  many  more  facilities  for  testing  regular
5651         expressions and the PCRE library. The pcredemo program is provided as a         expressions and the PCRE library. The pcredemo program is provided as a
5652         simple coding example.         simple coding example.
5653    
# Line 5451  PCRE SAMPLE PROGRAM Line 5655  PCRE SAMPLE PROGRAM
5655         the standard library directory, you may get an error like this when you         the standard library directory, you may get an error like this when you
5656         try to run pcredemo:         try to run pcredemo:
5657    
5658           ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or           ld.so.1: a.out: fatal: libpcre.so.0: open failed:  No  such  file  or
5659         directory         directory
5660    
5661         This is caused by the way shared library support works  on  those  sys-         This  is  caused  by the way shared library support works on those sys-
5662         tems. You need to add         tems. You need to add
5663    
5664           -R/usr/local/lib           -R/usr/local/lib
# Line 5471  AUTHOR Line 5675  AUTHOR
5675    
5676  REVISION  REVISION
5677    
5678         Last updated: 06 March 2007         Last updated: 13 June 2007
5679         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5680  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5681  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
# Line 5541  PCRE DISCUSSION OF STACK USAGE Line 5745  PCRE DISCUSSION OF STACK USAGE
5745         In environments where stack memory is constrained, you  might  want  to         In environments where stack memory is constrained, you  might  want  to
5746         compile  PCRE to use heap memory instead of stack for remembering back-         compile  PCRE to use heap memory instead of stack for remembering back-
5747         up points. This makes it run a lot more slowly, however. Details of how         up points. This makes it run a lot more slowly, however. Details of how
5748         to do this are given in the pcrebuild documentation.         to do this are given in the pcrebuild documentation. When built in this
5749           way, instead of using the stack, PCRE obtains and frees memory by call-
5750         In  Unix-like environments, there is not often a problem with the stack         ing  the  functions  that  are  pointed to by the pcre_stack_malloc and
5751         unless very long strings are involved,  though  the  default  limit  on         pcre_stack_free variables. By default,  these  point  to  malloc()  and
5752         stack  size  varies  from system to system. Values from 8Mb to 64Mb are         free(),  but you can replace the pointers to cause PCRE to use your own
5753           functions. Since the block sizes are always the same,  and  are  always
5754           freed in reverse order, it may be possible to implement customized mem-
5755           ory handlers that are more efficient than the standard functions.
5756    
5757           In Unix-like environments, there is not often a problem with the  stack
5758           unless  very  long  strings  are  involved, though the default limit on
5759           stack size varies from system to system. Values from 8Mb  to  64Mb  are
5760         common. You can find your default limit by running the command:         common. You can find your default limit by running the command:
5761    
5762           ulimit -s           ulimit -s
5763    
5764         Unfortunately, the effect of running out of  stack  is  often  SIGSEGV,         Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
5765         though  sometimes  a more explicit error message is given. You can nor-         though sometimes a more explicit error message is given. You  can  nor-
5766         mally increase the limit on stack size by code such as this:         mally increase the limit on stack size by code such as this:
5767    
5768           struct rlimit rlim;           struct rlimit rlim;
# Line 5559  PCRE DISCUSSION OF STACK USAGE Line 5770  PCRE DISCUSSION OF STACK USAGE
5770           rlim.rlim_cur = 100*1024*1024;           rlim.rlim_cur = 100*1024*1024;
5771           setrlimit(RLIMIT_STACK, &rlim);           setrlimit(RLIMIT_STACK, &rlim);
5772    
5773         This reads the current limits (soft and hard) using  getrlimit(),  then         This  reads  the current limits (soft and hard) using getrlimit(), then
5774         attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You         attempts to increase the soft limit to  100Mb  using  setrlimit().  You
5775         must do this before calling pcre_exec().         must do this before calling pcre_exec().
5776    
5777         PCRE has an internal counter that can be used to  limit  the  depth  of         PCRE  has  an  internal  counter that can be used to limit the depth of
5778         recursion,  and  thus cause pcre_exec() to give an error code before it         recursion, and thus cause pcre_exec() to give an error code  before  it
5779         runs out of stack. By default, the limit is very  large,  and  unlikely         runs  out  of  stack. By default, the limit is very large, and unlikely
5780         ever  to operate. It can be changed when PCRE is built, and it can also         ever to operate. It can be changed when PCRE is built, and it can  also
5781         be set when pcre_exec() is called. For details of these interfaces, see         be set when pcre_exec() is called. For details of these interfaces, see
5782         the pcrebuild and pcreapi documentation.         the pcrebuild and pcreapi documentation.
5783    
5784         As a very rough rule of thumb, you should reckon on about 500 bytes per         As a very rough rule of thumb, you should reckon on about 500 bytes per
5785         recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you         recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
5786         should  set  the  limit at 16000 recursions. A 64Mb stack, on the other         should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
5787         hand, can support around 128000 recursions. The pcretest  test  program         hand,  can  support around 128000 recursions. The pcretest test program
5788         has a command line option (-S) that can be used to increase the size of         has a command line option (-S) that can be used to increase the size of
5789         its stack.         its stack.
5790    
# Line 5587  AUTHOR Line 5798  AUTHOR
5798    
5799  REVISION  REVISION
5800    
5801         Last updated: 12 March 2007         Last updated: 05 June 2007
5802         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
5803  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5804    

Legend:
Removed from v.123  
changed lines
  Added in v.197

  ViewVC Help
Powered by ViewVC 1.1.5